No, GPT-4 Doesn’t Get Worse Over Time (FUD Debunked)

5/5 - (4 votes)

There has been a lot of drama on Twitter about the new Stanford UC Berkely collab paper titled “How Is ChatGPT’s Behavior Changing over Time?” (source)

The paper’s authors selectively provide examples where newer versions of GPT-4 seem to perform “worse” and have “formatting mistakes” than older versions.

The first evaluation they provide is the following showing a drop of GPT-4 accuracy from 97.6% to 2.4%, which seems shocking:

Twitter users just ran with it — the screenshots are just too good to generate fear, uncertainty, and doubt (FUD) about the model’s performance:

Since then, it has become clear that the conducted research was sloppy. For instance, Princeton CS Professor Arvind Narayanan found out that the performance degradation in finding prime numbers was due to a bias in the test data, not due to a model degradation (source):

Regarding the model degradation in the Leetcode example, deep learning researcher Simon Boehm corrected the wrong perception pointing out πŸ‘‡

"June GPT-4 started surrounding code with ```python markdown, which you didn't strip. I forked your code, removed that markdown, and re-submitted the output to Leetcode for judging. Now the June version does significantly better than the march version of GPT-4."

Of course, the researcher knew this and even pointed it out in the paper — but you had to read the whole thing:

So even though the paper presents the evaluations in a way that makes many people conclude that GPT-4 got worse, this is not the case.

OpenAI transparently stated that they continue to change the model using reinforcement learning with human feedback (RLHF), which can positively and negatively impact performance on some tasks (e.g., exams) but improves overall usability.

So the main contribution of the paper that models change over time is nothing new. The authors found a few needles in the haystack where GPT-4 performed worse in a later version. Then they strongly highlighted these examples in the primary paper “real estate”, such as the abstract or the first few graphs and charts.

The main takeaway, “GPT-4 output changes over time“*, is well-known by every regular model user.

Main Paper “Contribution”: “Our findings demonstrate that the behavior of GPT-3.5 and GPT-4 has varied significantly over a relatively short amount of time. This highlights the need to continuously evaluate and assess the behavior of LLMs in production applications.”

For instance, this is what OpenAI stated in their GPT-4 release notes:

OpenAI also stated publicly that “users can apply [Evals] for tracking performance across model versions (which will now be coming out regularly)”:

Clearly, every practitioner knows that the model output may change over time and that they need to check and possibly reformat the output of a large language model because it’s not 100% predictable, much like the reply of a human being.

The paper, however, doesn’t provide a significant scientific contribution even though the researchers are well-affiliated (UC Berkeley and Stanford) and have had many scientific impacts in their careers. I greatly respect the researchers and have relied on their work for my doctoral thesis on graph partitioning.

But if you ask me, that’s a trivial contribution and not really worthy of publication in a first-class conference or journal. I don’t believe this paper has really expanded the body of human knowledge. It just states the obvious but does it in a way that risks giving the wrong impression to the non-scientific community.

πŸ’‘ TLDR: GPT-4, like any large language model provided by OpenAI, may change over time. Its overall performance has not been getting worse over time.

Let’s conclude this article with a great perspective provided by Princeton CS Professor Arvind Narayanan:

“We dug into a paper that’s been misinterpreted as saying GPT-4 has gotten worse. The paper shows behavior change, not capability decrease. And there’s a problem with the evaluationβ€”on 1 task, we think the authors mistook mimicry for reasoning.”Arvin Narayanan