Grok 4 is Here and it’s Simply Brilliant!


“It’s smarter than almost all graduate students in all disciplines – Elon Musk.”

Elon Musk and his Grok team are back with their latest and best model to date: Grok 4. It was only 3 months ago that this team of experts launched Grok 3, a model that still competes with the giants from OpenAI, Gemini, and Anthropic. But with Grok 4, Elon Musk is giving these companies a run for their money. Grok 4 comes with superhuman-level thinking and reasoning capabilities. With tools and agents in its arsenal, it brings a better understanding of the world, both personal and professional. In this blog, we’ll explore everything about Grok 4: its features, capabilities, benchmarks, and finally, we’ll test it.

Let’s Grok it!

What is Grok 4?

Grok 4 is the latest multi-modal large language model (LLM) from Elon Musk’s company, x.ai. It has 100 times more training data than Grok 2 (the first public model by x.ai) and 10 times more reinforcement learning compute than any other model available. Grok 4 features a 256K context window, real-time data search, advanced voice capabilities, agentic abilities, and intelligence that closely mimics human behavior.

Grok 4 comes in two different versions:

  • Normal Version: This is the single-agent version of the Grok 4 LLM. It features agentic behavior, where one agent works to solve your problems. This model is useful for daily tasks involving language, search, coding, and more. It’s available in the Super Grok plan offered by x.ai and also via API for developers.
  • Grok 4 Heavy: This is the multi-agent version of Grok 4. When prompted, multiple agents collaborate, compare outcomes, and generate the best result. It’s ideal for complex reasoning, deep analysis, and research. It is available only under the Super Grok Heavy plan by x.ai.

Key Features

  • It’s an Academic Whiz: Grok 4 shines on the Humanity’s Last Exam (HLE) benchmark. Out of 2,500 questions spanning math, physics, chemistry, humanities, and computer science, it scored double digits on half! Most current models manage only low single digits, suggesting Grok 4 can tackle PhD-level problems across disciplines.
  • Tool Use: Grok 4 has been trained natively on tool use, outperforming Grok 3’s research tools. With extensive scaling and compute, it can handle even the toughest text-based problems.
  • Its design is Agentic: The Grok 4 models are agentic. With single and multiple agents working behind the scenes, these models can swiftly perform multiple tasks. 
  • Its enhanced voice capabilities: The Grok 4 models come with an advanced voice mode that sounds more personal and calm compared to the other models from Open AI and Gemini. It comes with a new voice, “Eve” – a British speaker that can quickly switch from singing to whispering, mimicking human-like emotions.  Along with this, the latency of their latest voice mode has been reduced by half, compared to its previous version.
  • It can run a business: The Grok 4 models can reason out like humans and take decisive decisions, strategise, and plan in a way that makes them capable of running a business. Infact, they might just help you make some profit too. 

When it comes to multimodal capabilities, especially image analysis and generation, Grok 4 models currently perform poorer than the top models like o3, Gemini 2.4 Pro, Claude 4, etc. Although this may improve significantly in the coming few days (or weeks).

Availability

Grok 4 Availability
Source: X
  • Super Grok: Includes Grok 4 and Grok 3. Comes with a 128K token window, voice and vision capabilities. Priced at $30/month or $300/year.
  • Super Grok Heavy: Includes Grok 4 Heavy and Grok 4. Offers an enhanced context window and early access to new features. This premium plan costs $300/month or $3,000/year, comparable to OpenAI’s and Google’s premium tiers.

How to Access Grok 4?

To access Grok 4 on chat:

  1. Head to Grok
  2. Log in to your Super Grok account.
  3. In the chatbox in the middle of the screen and click on the small model dropdown at the corner of the chatbox. 
  4. Select the “Grok 4” model
How to Access Grok 4?
Source: Grok
  1. Once done, you can get started.

 To access Grok 4 on the API:

  1. Go to https://x.ai/api and click on API Console Login.
  2. Click on API Keys.
  3. Click on Create API key and after that give a name to your api key and click on Save to generate your grok api key.
  4. Now to access the Grok 4 using api endpoints, visit https://docs.x.ai/docs/models/grok-4-0709 and use the below code snippet to access it.
from xai_sdk import Client

from xai_sdk.chat import user, system

client = Client(

    api_host="api.x.ai",

    api_key=""

)
chat = client.chat.create(model="grok-4-0709", temperature=0)

chat.append(system("You are a PhD-level mathematician."))

chat.append(user("What is 2 + 2?"))

response = chat.sample()

print(response.content)

Grok 4 in Action

Now that we’ve read all about Grok 4, it’s time to see if it brings in the punch as it claims. To do this, we will test Grok 4 on the following tasks:

  1. PhD-level Question to test their reasoning capabilities
  2. Multi-step research to check its agentic capabilities
  3. Coding with context to test its real-world use capabilities

Let’s start. 

Task 1: Solving a PhD-level Question

Result:

Analysis:

Grok 4 approached the problem step-by-step, addressing each question in order. It correctly interpreted the prompt, reasoned through the solution, and even generated code for the graphs when asked. The visualizations were accurate and aligned with the explanation.

Task 2: Performing a Multistep Research

Prompt: “Tell me about Analytics Vidhya’s latest post on X and find the latest blog on their website – summarise information on them in 5 lines each.

Result:

Analysis:

This task it performed better than I had imagined. The task itself is not difficult, but I see so many models struggling with the dates to accurately fetch the latest information. Grok 4 took only a few seconds. It went through the website and the Twitter page, found the latest information, and then reasoned it out to give me 5 concrete lines on each. 

You can check it yourself on our blog page or X page. 

Task 3: Doing Coding with Context

Prompt: “Merge all these PDFs and create a single JSON file.”

Files

Result:

Doing Coding with Context using Grok 4

Analysis:

It started well, by listing down the content from a few files, and then began the hallucinations. All that I got in the result was a stream of #. So this was disappointing

Prompt 2: “Convert the following code into Python and React

Code File

Result:

Analysis:

Grok 4 was quick and pretty efficient, it quickly generated the code in Python and actually understood that with the “react” word in my prompt. I was looking forward to seeing the code for my app’s frontend. It then also presented the code for each section, making it simple for me to copy the required part as and when it is needed. 

Grok 4 Benchmarks

Grok 4 almost aced all of the benchmarks that we usually look at. Here is a summary:

Benchmarks - Grok 4
Source: X
  1. GPQA (Graduate-Level Physics Questions Archive): This benchmark test expert expert-level science knowledge. On this benchmark, Grok 4 achieves 87-88%, leading competitors like GPT-4o and Claude 3.5 Sonnet.
  2. AIME (American Invitational Mathematics Examination) 2025: This benchmark compares the mathematical prowess. Grok 4 scores 95%, with some reports claiming up to 100% dominance. This surpasses previous SOTA models.
  3. SWE-Bench (Software Engineering Benchmark): It evaluates coding and real-world software problem-solving (Grok 4 Code variant). Scores range from 72-75%, significantly ahead of o3-mini (high) and Claude 3.5 Sonnet.
  4. Other Math and Reasoning Benchmarks: Grok 4 dominates U.S. Mathematical Olympiad and Harvard-MIT Mathematics Tournament, and similar tests with massive gains over prior SOTA. It also excels in general reasoning and Ph.D.-level tasks across fields.

These are the usual benchmarks for testing any latest LLM. Grok 4 also came with its scorecard on two new benchmarks: ARC-AGI and Vending Bench.

ARC-AGI

This benchmark checks how close models are to achieving AGI, or artificial general intelligence. This is done by scoring their performance on different ARC-style tasks, which are a collection of challenging puzzles.

Arc - agi
Source: X

Grok 4 takes up the top spot, breaking the 10% barrier, meaning the model has taken its first steps into general reasoning. Claude Opus 4 models follow next and then come o3 (high), o4-mini(high), and others! This seems that Grok 4 is essentially closer to AGI than the rest of its peers. 

Vending Bench

This benchmark tests the agentic AI systems to measure how well these agents can interact with a real e-commerce website to complete complex tasks.  It’s designed to stress test real-world decision making, planning, and UI interaction. 

Grok 4 excels in this too, beating some human, Claude 4, Opus, and Gemini 2.5 Pro and o3. 

Vending Bench - Grok 4
Source: X

Infact, the Grok 4 was tested to run an actual vending machine to test this, and it incurred huge profits while doing so. Anthropic had released something similar about Claude running a vending machine a few days back, and in that, they had mentioned that the machine ran into a loss!

Applications of Grok 4

Grok 4 comes with a great set of features and performance benchmarks, based on which it can be pretty useful for:

  1. Real-Time Social Media Interaction: It is integrated directly into X (formerly Twitter) as a chatbot. It can be used to generate memes, posts, polls, summaries, or sentiment analysis.
  2. Advanced Research: It can solve PhD-level questions, thus indicating that it can truly contribute to advanced research in mathematics, physics, and engineering.
  3. Business Planning: It can help to map out strategies and perform advanced business analysis to help you get actionable insights. 
  4. Coding and Writing: Grok 4 comes with brilliant SWE benchmarks and agentic capabilities, thus it can take up many coding tasks and perform them well too. 

Grok 3 vs Grok 4

Although Grok 3 has been in the spotlight for its racist comments, with Grok 4, the team is looking to do more than just damage control. Grok 4 comes with tool use integrated from the start, and the Grok team plans to upgrade this to “commercial grade” capabilities, helping you solve actual, real-world problems. Along with this, we can expect Grok 4 to master video and image analysis and generation very soon, bringing us closer to experiencing playable AI-generated video games and fully AI-generated shows.

Conclusion

Is Grok 4 a big deal? Definitely. In a market that feels increasingly saturated, it stands out as a breath of fresh air, offering real improvements over its predecessors. With actual use cases emerging, it seems poised to help solve many everyday problems. Both standard and Heavy variants are agentic, fast, and significantly better at reasoning. While some suggest it’s built for AGI, I believe there’s still time and room for growth. Grok 3 also launched with great promise but later went off track. With this new release, it’s just the beginning, much testing is still needed to understand its true potential.

Anu Madan is an expert in instructional design, content writing, and B2B marketing, with a talent for transforming complex ideas into impactful narratives. With her focus on Generative AI, she crafts insightful, innovative content that educates, inspires, and drives meaningful engagement.

Login to continue reading and enjoy expert-curated content.

We will be happy to hear your thoughts

Leave a reply

Som2ny Network
Logo
Compare items
  • Total (0)
Compare
0