We Thought ChatGPT Was Great for Programming. A New Study Finds That Half of Its Answers Are Wrong

OpenAI’s AI chatbot continues to grow in popularity.
However, a new study finds that 52% of its computer programming answers contain incorrect information.

May 28, 2024, 12:26

Updated June 4, 2025, 12:45 ET

Javier Márquez

Writer

Looking for solutions on Stack Overflow or searching on Google isn’t as trendy as it used to be. Many programmers are turning to ChatGPT as a useful tool to improve their workflow and reduce their reliance on these platforms. However, OpenAI’s AI chatbot isn’t flawless, so depending on it entirely may not be the best approach.

Like any other tool based on large language models (LLMs), ChatGPT has its limitations. The company, led by Sam Altman, acknowledges on its website that its chatbot can make mistakes and encourages users to fact-check important information. So, how well does it perform in the programming world? Let’s see what some researchers have to say.

When More Than 50% of the Answers Are Wrong

A team of researchers at Purdue University recently conducted a study in light of the “rising popularity of ChatGPT” and the habit LLMs have of generating “fabricated texts" that aren't exactly easy to recognize. While many responses may seem plausible, they can also be incorrect.

“Our analysis shows that 52% of ChatGPT answers contain incorrect information,” the study reads. It adds that 77% of the responses are more detailed than human responses (which doesn’t guarantee their accuracy). Additionally, 78% of these answers suffer from varying degrees of inconsistency. These figures really don’t go unnoticed.

To obtain these values, the researchers took 517 programming questions from Stack Overflow. They then examined the correctness, consistency, comprehensiveness, and conciseness of the answers provided with ChatGPT based on GPT-3.5. They also conducted a large-scale linguistic analysis, including a user study, to understand ChatGPT answers from different points of view.

The research team decided to use the most-widely used free version of the chatbot, GPT-3.5, instead of GPT-4, the most recent version of the language model available at the time of the study. It’s important to mention that they simultaneously conducted tests with GPT-4 and found that while the newer model performs “slightly better,” both models have a high inaccuracy rate.

When we talk about ChatGPT, we’re talking about an AI chatbot that can be used for different tasks, including writing a letter. In programming, there are other AI-powered tools tailored specifically for developers, like GitHub Copilot, which integrates with development environments.

We Thought ChatGPT Was Great for Programming. A New Study Finds That Half of Its Answers Are Wrong

OpenAI’s AI chatbot continues to grow in popularity.

However, a new study finds that 52% of its computer programming answers contain incorrect information.

When More Than 50% of the Answers Are Wrong