o1 Doesn’t ‘Reason’ and Isn’t an AGI, but It Does Address the Major Issue With Chatbots: Hallucinations

Let’s count the letter “R,” shall we?

When you ask GPT-4o how many “R’s” are in the word “strawberry,” it’ll mistakenly tell you that there are two. This error occurs because the AI model doesn’t process the text in the same way humans do. Instead, it divides the text into tokens, leading to this mistake.

Surprisingly, even powerful models, like Anthropic’s Claude, make the same mistake. You could make fun of these models for failing to solve such a simple problem while being great in other areas. However, the reality is that they’re not designed to handle this type of task.

Claude gets it wrong.

In fact, when testing the basic and free version of ChatGPT, the answer appears to be correct, but this is because the chatbot cheats by creating a small program to count the “R’s” instead of counting them as designed. This behavior likely changed after several users downvoted the wrong solution. Believe me, it didn’t do that originally.

That’s where the “magic” of o1, OpenAI’s new model, comes in. At the moment, only preliminary versions of the model are available, but even these versions are a significant advance over other chatbots and models like GPT-4o because they simply make fewer mistakes.

In one of the o1 demo videos, OpenAI researcher Hyung Won Chung used the example of the “R’s” in the word “strawberry.” The o1 model was able to give the correct answer quickly. According to Chung:

“Having reasoning [capability] built-in can help avoid mistakes because it can look at its own output, review it, and be more careful.”

Chung’s statement is important but misleading. The o1 model doesn’t “reason,” at least not in the “human” sense of the word. The model still doesn’t know what it’s saying. As he rightly points out at the end, what it does indeed do is something important: review.

Review is the key feature of o1, a model that takes longer to respond because it works as fast as GPT-4o (or maybe faster) in coming up with a solution but doesn’t offer it to the user directly. Instead, it reviews it. If it finds any mistakes, it iterates back on itself, correcting the error, re-proposing a solution, and reviewing it until it detects (or believes) there are no mistakes.

This iterative trial-and-error process seems to be the basis of o1, as the demonstration video above shows. Research lead Jerry Tworek uses a logic puzzle where the AI model needs to find out how old a prince and a princess are. With its first responses, o1 detects the variables and equations, then reviews them, solves the problem, and finally validates the solution.

The final answer shows this way of processing the information and solving the problem but only after checking that everything has gone well. Unlike its predecessors, it doesn’t answer “out of the blue.”

While o1’s clear and forceful tone is similar to its predecessors, it doesn’t try to convince you that its first response is the right one. In this case, users have the guarantee that the model has checked what it says before showing it to them.

More from Xataka On

OpenAI and Anthropic Agree to Share Their AI Models With the U.S. Government Before Releasing Them. The Reason: To Ensure That They’re Safe

Is this a revolution? I wouldn’t say so. As expert Gary Marcus said in an X post, “It’s not AGI, or even close.” It’s an interesting leap forward in reducing errors and hallucinations. It may be especially useful in scenarios where waiting a bit longer to avoid errors and having mechanisms to fix them doesn’t matter.

For instance, developers have widely used generative AI in the programming world. However, ChatGPT fails more than it should in that area. Improving it will simplify programmers’ lives even more.

The teams at GitHub and Devin commented on this, explaining that o1 represents an important qualitative leap. For example, they used it to ask Devin to analyze an X post using the machine learning libraries textblob and text2emotion.

When trying to solve the problem, Devin encountered the following message: “AttributeError: module ‘emoji’ has no attribute ‘UNICODE_EMOJI’.” In comparison, when GPT-4o attempted to solve that exception, it got confused because the problem wasn’t there but in the version of the emoji library. The o1-preview model they used “come to the right conclusion by researching online like a human engineer would.”

Does this guarantee that o1 won’t make mistakes? Not at all. The model still makes errors. For example, an X user demonstrated that o1 made a mistake in a tic-tac-toe game. OpenAI CEO Sam Altman himself cautioned about this when he announced it on Twitter, stating that “o1 is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it.”

Nevertheless, it’s a significant advancement in situations where accuracy is more crucial than speed. In the future, there will probably be models that can analyze and respond almost instantaneously. This is when GPT-4o’s voice capabilities will be even more remarkable.

On a side note, the new voice synthesis features are still unavailable even for paid users. When asked about it, Altman replied rather arrogantly, “How about a couple of weeks of gratitude for magic intelligence in the sky, and then you can have more toys soon?”

Image | OpenAI

o1 Doesn’t ‘Reason’ and Isn’t an AGI, but It Does Address the Major Issue With Chatbots: Hallucinations

This new artificial intelligence model utilizes a simple technique to make fewer mistakes: It reviews what it says.

Despite this, its applications are very interesting and indicate a future in which chatbots will be fast without making as many mistakes.

Receive "Xatakaletter", our weekly newsletter