I spent several hours testing Grok 3, the latest version of xAI’s AI model. My goal was to evaluate its capabilities and performance compared to other models, such as ChatGPT, Claude, Le Chat, and DeepSeek.
Reasoning and Problem-Solving
- Grok 3 excels at math problems. I had it attempt the 2024 AIME challenge, and it successfully solved six out of the 15 problems. In comparison, OpenAI’s o3-mini-high version solved only nine. Additionally, OpenAI’s model took almost six minutes to complete the problems, while Grok 3 took just under five minutes. Seeing Grok’s self-assessment processes while determining the correct answers was striking, although it sometimes missed the mark.

- Basic reasoning tests included counting the number of repeated letters in complex words (like the classic “Lollapalooza”) and comparing decimal values (for instance, 9.11 versus 9.9). Grok 3 provided correct answers in these tests after a few seconds of apparent “reasoning.”

- In a question about Greek mythology, I asked the chatbots who Jason’s maternal great-grandfather was. Grok 3 found the correct answer in 18 seconds, while o3-mini-high took 22 seconds and ultimately failed. Well played, Grok.
Search and Synthesis Capabilities
- Grok 3’s DeepSearch function is fast but sometimes lacks accuracy and overlooks important details. For example, when I asked it to analyze the impact of AI on chip design, it generated a 1,504-word text with several citations in just over a minute. However, it failed to mention significant advancements such as Google’s AlphaChip framework. When I made subsequent attempts, it eventually included that information.
- I also requested a comprehensive report on Xataka, covering financial, media, and reputational aspects. While it was quite accurate, it exhibited a common limitation of any deep research system. It’s well-versed in publicly available information but lacks deeper insights. It doesn’t possess the criteria of an expert who understands what’s publicly known and the underlying factors. This shortcoming is characteristic of Grok and other deep research systems. When you seek information about a topic you’re unfamiliar with, it’s easy to assume that the deep research feature provides a complete picture. However, these limitations become more apparent when you’re directly involved in the subject.

- The speed is impressive. Grok 3’s DeepSearch is noticeably faster than OpenAI’s Deep Research, but this comes at the cost of sacrificing depth for speed. However, its selection of sources and citations is usually quite good.
- Unlike Gemini, Grok 3 doesn’t allow users to export reports directly to documents or customize the research approach. While Grok is very smart and capable, it lacks robust features. A strong language model is of little use if it requires users to start from scratch and process all the information manually.
Creativity and Tone
- I asked for a story about a time traveler facing a paradox to test Grok 3’s creative writing abilities. The result was solid in character building, details, descriptions, and atmosphere, surpassing what I consider the best in this area, Claude 3.5 Sonnet. However, some plot twists felt forced.

- Its humor tends to be basic and predictable, primarily relying on obvious puns and adolescent humor. If the “uncanny valley” concept can apply to a chatbot, Grok 3 is at the 99% mark. Its humor is too refined and too predictable at the same time.
- Grok 3 maintains political neutrality on contentious issues like immigration or trans rights. Musk claims it can be politically incorrect, but this depends more on user input than on an inherent personality trait. In other words, it can venture into politically incorrect territory only if the user prompts it to do so.
Grok 3’s Limitations
- Grok doesn’t allow users to customize the model’s behavior, unlike ChatGPT, or response style, as Claude does.
- The interface is confined to a text box and includes only basic buttons for file attachment, activating DeepSearch, and enabling reasoning mode. Unlike Claude, ChatGPT, and Le Chat, users must always start with a blank slate. As such, it can’t retain context or established guidelines that could streamline your work.

- Grok 3 has stricter security guardrails compared to Grok 2. While the previous version was notable for its lack of restrictions, Grok 3 has implemented measures to limit inappropriate content generation. For example, it refused to help create a template for a mass email fraud campaign claiming to be a prince seeking an heiress.
- When it comes to image generation, Grok 3 appears to be more lenient. Unlike Midjourney, which prohibits the use of terms like “Donald Trump” or “President of the United States,” Grok 3 is less restrictive, even regarding owner Elon Musk.

You can try Grok 3 on its official website or through its integration with X. It’s currently free, but it’s expected to become part of a paid subscription service for X, which is likely to be expensive.
Although Grok 3 has undeniable capabilities, several similar alternatives exist, and being slightly smarter or faster may not be enough to stand out. The true differentiator lies in the product itself, and this is where Grok 3 has significant room for improvement.
Image | Xataka
Log in to leave a comment