Notes on How is ChatGPT's behavior changing over time?

Link to paper: https://arxiv.org/abs/2307.09009

Paper published on: 2023-07-18

Paper's authors: Lingjiao Chen, Matei Zaharia, James Zou

GPT3 API Cost: $0.02

GPT4 API Cost: $0.10

Total Cost To Write This: $0.12

Time Savings: 10:1

The ELI5 TLDR:

This research is about two AI language models called GPT-3.5 and GPT-4. The researchers found that these models can change a lot in a short amount of time, so it's important to keep checking how well they work. They tested the models on different tasks like solving math problems and generating code. They also looked at how the models answered sensitive questions. They found that the models' performance varied over time. GPT-4 became less willing to answer sensitive questions and made more mistakes in generating code. GPT-3.5, on the other hand, became more willing to answer questions and made fewer mistakes. The researchers also found that GPT-4 was better at defending against attempts to trick it into revealing sensitive information. Both models became more verbose in generating code, but the percentage of code that could be directly used decreased. The research shows that it's important to keep evaluating these models because they can change a lot. The researchers plan to keep studying them and have made their data available for other researchers to use.

The Deeper Dive:

Understanding the Temporal Variability in AI Language Models: A Study on GPT-3.5 and GPT-4

This piece aims to delve into the key findings of a recent research paper that puts the spotlight on the temporal behavior of two AI Language Models (LLMs), namely GPT-3.5 and GPT-4. The research underlines the fact that the performance of these models can change significantly over a relatively short span of time, and thus, continuous monitoring of their quality is crucial.

Performance Variability Over Time

The research paper presents an intriguing observation about the performance variability of GPT-3.5 and GPT-4 over time. For instance, GPT-4 (March 2023) demonstrated high accuracy in identifying prime numbers, but this accuracy plummeted by June 2023. Similarly, GPT-3.5 showed an improvement in solving math problems from March to June 2023.

These variations were not just limited to mathematical tasks. The models also exhibited changes in their willingness to answer sensitive questions and their accuracy in code generation. For instance, GPT-4 was less inclined to answer sensitive questions in June than in March, and both models made more formatting mistakes in code generation in June than in March.

Evaluating Performance: Metrics and Tasks

The researchers used specific metrics and tasks to evaluate the performance of these LLMs. The main metrics used for assessing the LLMs' performance in solving math problems were accuracy, verbosity, and answer overlap.

Accuracy was measured by comparing the models' answers to the correct ones. Verbosity was gauged by the length of the models' responses. Answer overlap, on the other hand, was evaluated by comparing the models' answers to each other.

The tasks used for the evaluation were code generation, puzzle solving, and visual reasoning. For code generation, the researchers looked at the percentage of directly executable generations, the number of characters in the generated code, and the presence of extra non-code text in their generations. For puzzle solving and visual reasoning, they assessed the models' performance based on their accuracy in providing correct answers.

Drift Patterns in Reasoning-heavy Tasks

The paper also discusses the chain-of-thought approach used for reasoning-heavy tasks. This approach involves providing the models with a series of prompts to elicit a chain of reasoning. The researchers observed different drift patterns in GPT-4 and GPT-3.5 using this approach.

In terms of answering sensitive questions, GPT-4 became more conservative, answering fewer questions and generating shorter responses. GPT-3.5, on the other hand, became less conservative, answering more questions and generating slightly longer responses.

Defense Against Jailbreaking Attacks

Jailbreaking attacks refer to attempts to trick the models into revealing sensitive information. The research paper found that GPT-4's update offered a stronger defense against these attacks, with a decrease in the answer rate from 78.0% in March to 31.0% in June. GPT-3.5's defense against jailbreaking attacks did not show a significant drift, with a slight decrease in the answer rate from 100.0% in March to 96.0% in June.

Code Generation Over Time

In the context of code generation, both models became more verbose over time, with an increase in the number of characters in the generated code. However, the percentage of directly executable generations decreased from March to June for both GPT-3.5 and GPT-4. This decline may be attributed to the models adding extra non-code text to their generations.

The Need for Continuous Evaluation

The research underlines the need for continuous evaluation and assessment of language models in production applications. As the performance of these models can change substantially over a short period, it is essential to monitor their quality continuously to ensure their effectiveness and reliability.

The researchers plan to update their findings in an ongoing long-term study. They have also made the evaluation data and ChatGPT responses available for further research.

The paper also cites several related studies, including a preliminary study on whether ChatGPT is a good translator, an evaluation of GPT-4's logical reasoning ability, and an exploration of GPT-4's capabilities on medical challenge problems, among others.

In conclusion, this research emphasizes the dynamic nature of AI language models and underscores the importance of continuous monitoring and evaluation. The findings serve as a reminder for developers and businesses to stay vigilant and adaptive in their use of AI technologies.

Notes on How is ChatGPT's behavior changing over time?

The ELI5 TLDR:

The Deeper Dive: