Why can't we trust ChatGPT?
Why can't we trust ChatGPT?
New research reveals that the AI models behind ChatGPT have exhibited extreme instability, which may affect future use and monetization
A study published by Berkeley and Stanford universities last week revealed instability in the outputs of GPT-4, the latest generative artificial intelligence model used by OpenAI's ChatGPT. The study highlighted significant changes in GPT-4's performance over a brief three-month period, particularly in relatively simple tasks. Notably, the model showed a dramatic decline in accuracy in identifying prime numbers, plummeting from 97.6% in March to a mere 2.4% in June. Surprisingly, the GPT-3.5 model, upon which the free public version runs, actually exhibited an improvement in this aspect.
OpenAI acknowledged the research and said it was aware of the reported regressions. Logan Kilpatrick, the company's head of developer relations, tweeted that their team was actively investigating the matter.
The study, available as an ArXiv preprint in mathematics and computer science, focused on the performance of both GPT-3.5 and GPT-4 in March and June. It came as a response to subjective complaints from users about a decrease in performance. Speculations emerged that OpenAI intentionally reduced GPT-4's capabilities to cut costs and increase profits or to manipulate results for political correctness. However, a company VP denied such claims, stating that newer versions are designed to be smarter than their predecessors, and issues might become more apparent with heavy usage.
The study examined four parameters: mathematical problem-solving, answering sensitive questions, code creation, and visual thinking. The models showed significant variability in all of these aspects within less than three months. For instance, the older version of the model demonstrated improved accuracy in identifying prime numbers (from 7.4% in March to 86.8% in June) but still struggled with more complex code generation. Moreover, the newest GPT-4 version from June had only 10% of its generated code working according to instructions, whereas the March version had 50% executable code.
GPT-4 faced difficulties in answering consecutive questions accurately in June compared to March, while GPT-3.5 showed improvement. The length of answers from the younger model also decreased significantly from March to June, whereas GPT-3.5 provided answers that were 40% longer during the same period.
When faced with intentionally misleading questions, both models exhibited increased resistance. GPT-4 provided direct answers to such questions 21% of the time in March, but only 5% in June. Similarly, GPT-3.5 saw a decrease from 8% to 2% between March and June.
The study raised concerns about OpenAI's lack of transparency in updating and deploying their models. While updates and adjustments might address specific issues, they could also introduce greater variation in other areas. The researchers did not offer an explanation for the observed changes, but a separate study from the University of Oxford suggested that training models on internet-generated data and on what is essentially the mistakes they previously generated could ultimately lead to incoherence.
Despite some misunderstandings and reservations about the study, critics and supporters alike agree that there is a fundamental problem with OpenAI's approach to deploying and updating its models. The lack of transparency and open-source code hinders the ability to provide reliable software on a platform that undergoes undocumented changes. OpenAI and its major investor, Microsoft, are currently seeking profit models for these products, with Microsoft recently announcing a subscription plan for access to generative AI products within its Office tools, while OpenAI is planning to sell access to models and allow companies to build their own applications on top of them.