back to top
HomeTechBaidu’s ERNIE 5.1 Is Rivaling Gemini 3.1 Pro at AI Search

Baidu’s ERNIE 5.1 Is Rivaling Gemini 3.1 Pro at AI Search

- Advertisement -

Baidu has been doing search longer than most AI companies have existed. While OpenAI was still a research lab and Anthropic hadn’t been founded yet, Baidu was already the dominant search engine for 1.4 billion people. Search is not something they learned recently.

So when ERNIE 5.1 lands 4th on the Search Arena global leaderboard above Gemini 3.1 Pro with Grounding, above GPT-5.4 search and even above Google’s own search-augmented model, it’s surprising only if you forgot who built it.

Most Western AI coverage treats ERNIE as a footnote. These numbers suggest that’s worth reconsidering.

4th Globally on Search. Above Google’s Gemini.

aren ai search and text rankings for earnie 5.1
via: baidu.com

Search Arena is an independent evaluation where real users compare how well different models answer questions that require searching and reasoning over current information. The kind of task where knowing a lot isn’t enough, the model has to find the right information, understand it, and produce a useful answer.

ERNIE 5.1 scores 1,223 Elo on Search Arena. Claude Opus 4.6 Search leads at 1,255. GPT-5.5 Search sits at 1,242. ERNIE is behind both but ahead of Claude Sonnet 4.6 Search, Gemini 3.1 Pro with Grounding, and every other model on that leaderboard.

For a model that most Western developers haven’t touched and most Western coverage hasn’t covered seriously, that placement deserves more than a footnote.

The text Arena tells a similar thing. ERNIE 5.1 Preview sits 13th globally at 1,476 Elo between GPT-5.5 and Grok-4.20 Multi Agent. These are global evaluations with actual competition from the models everyone talks about.

The benchmark numbers

Baidu Ernie 5.1 Benchmarks
via: baidu.com

AIME26 with tools at 99.6 puts ERNIE 5.1 second only to Gemini 3.1 Pro at 99.9 on competition mathematics with tool use. DeepSeek V4 Pro scores 92.6 on the same benchmark. Claude Opus 4.6 scores 81.2. The gap between ERNIE and the next closest model on that specific benchmark is not small.

τ3-bench, which tests multi-step agent execution across realistic scenarios, scores 67.9 narrowly ahead of DeepSeek V4 Pro at 67.5 and Gemini 3.1 Pro at 67.1. Claude Opus 4.6 leads that chart at 72.4.

SpreadsheetBench tests whether a model can handle real spreadsheet reasoning tasks — the kind of structured data work that comes up constantly in actual business workflows. ERNIE scores 72.5, ahead of DeepSeek V4 Pro at 67.0 and Gemini 3.1 Pro’s result.

But before reading too much into any of this. Baidu published these benchmarks. Most of the comparison numbers come from their own evaluation setup. The Search Arena and text Arena results are independently verified which is exactly why those two numbers are the ones worth leading with.

The technical bet that made this possible

Most model families are built in a different way. You decide you want a 7B model, a 13B model, and a 70B model, then you run three separate training jobs. 3x the compute, 3x the time, three results that aren’t guaranteed to be consistent with each other.

Baidu did something different with ERNIE 5.1. They trained one large model ERNIE 5.0 in a way that contains multiple model sizes within it simultaneously. During training the system randomly varies how many layers are active, how many experts participate in routing, and how many experts activate per token. The result is a single training run that produces an entire family of models at different sizes and compute budgets.

ERNIE 5.1 is extracted from that matrix rather than trained from scratch. Total parameters compressed to roughly one third of ERNIE 5.0. Active parameters at roughly half. Pre-training compute at 6% of what a comparable model trained from scratch would cost. It means Baidu can iterate faster than labs spending full compute on every model. The efficiency compounds into more experiments, faster feedback, and quicker improvements.

Whether that architectural advantage shows up in the benchmarks is exactly what the Search Arena result suggests it does.

Related: ERNIE-Image: Open-Source 8B Text-to-Image Model for Posters, Comics & Structured Generation

The four stage post training pipeline

One of the quiet problems in training large models is what researchers call the seesaw effect. Make the model better at coding and it gets slightly worse at creative writing. Improve math reasoning and instruction following regresses. Every capability gain trades against something else.

Baidu’s approach to ERNIE 5.1 post training attacks this problem directly. Instead of training everything together in one stage, they separate expert training from capability fusion across four sequential stages.

First a unified fine tuning stage establishes baseline instruction following. Then separate expert models for coding, reasoning, and agentic tasks are trained in parallel each one optimized independently without interference from the others. Then a distillation stage pulls all those expert capabilities into one unified model simultaneously. Finally a general reinforcement learning stage handles the capabilities that don’t distill well, creative writing, open ended conversation, tasks with high output diversity.

The result is a model that doesn’t have to compromise between capabilities because those capabilities were never in competition during training. Whether that holds up in practice is what the creative writing claims and the benchmark spread suggest, strong across multiple categories without obvious regressions.

You May Like: Best AI Coding Models for Consumer Hardware

Limitations

ERNIE 5.1 is not open weights. There is no Hugging Face release, so there is no way to run it locally. Access is through Baidu’s own products including the ERNIE website, Baidu AI Studio, and the creative platforms rolling out ERNIE 5.1 integration.

The benchmark situation is also worth being clear about. The Search Arena and text Arena results are independently verified and those are the numbers that matter most for assessing where this model actually stands. The agentic and knowledge benchmarks come from Baidu’s own evaluation setup. Plausible given the Arena results but not independently confirmed.

Creative writing claims are the hardest to evaluate. Approaching Gemini 3.1 Pro on internal creative writing evaluations is a specific claim that requires actually using the model across many creative tasks to verify. Internal evaluations from the model’s own developer are the least reliable signal in the entire release.

Who should care

If you build search products, document retrieval systems, or anything where finding and synthesizing current information matters, ERNIE 5.1’s Search Arena placement is worth paying attention to. 4th globally on an independent leaderboard above Google’s own search model is a signal that deserves more than the coverage it’s getting.

If you work with Chinese language content or serve Chinese-speaking users, this is the most capable model available for that use case by a significant margin.

If you need open weights, local deployment, or hardware you control, this isn’t the model for that. Baidu keeps ERNIE behind their own products and that’s unlikely to change.

The broader point is simpler. Western AI coverage has a blind spot around models that aren’t from OpenAI, Anthropic, Google, or the handful of Chinese labs that release open weights. ERNIE 5.1 sits 13th globally on text Arena and 4th on Search Arena. That’s not a regional result. That’s a global one and it deserves to be treated that way.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
Claude Knew It Was Being Tested. It Just Didn't Say So. Anthropic Built a Tool to Find Out

Claude Knew It Was Being Tested. It Just Didn’t Say So. Anthropic Built a...

0
Anthropic built a tool that reads Claude's thoughts. Not the words Claude produces. The internal representations like the numerical signals firing inside the model before any words get generated. And when they pointed it at Claude during safety testing, they found something that should make anyone building or using AI pay attention. Claude knew it was being tested. It just didn't say so.
OpenAI GPT Realtime2 Voice

OpenAI’s New Voice Models Want to Do More Than Talk Back

0
OpenAI is pushing deeper into voice. The company just launched three new realtime audio models in its API. GPT-Realtime-2 for conversational reasoning, GPT-Realtime-Translate for live multilingual translation, and GPT-Realtime-Whisper for streaming speech transcription. GPT-Realtime-2 can now handle longer conversations, recover from interruptions more naturally, use tools while someone is still talking, and respond with different reasoning levels depending on the task. OpenAI says the model is designed for things like customer support, scheduling, travel assistance, and other workflows where the AI actually has to keep track of context instead of just replying quickly. OpenAI is no longer treating voice as a side feature attached to chatbots. It’s starting to position voice as the interface itself. That means live translation during conversations. Real time transcription while meetings are still happening. AI agents that can check your calendar, pull information from apps, or complete actions while the conversation keeps moving.
SubQ 12M context AI model

SubQ’s 12M Token Model Could Change How AI Handles Long Context. If It’s Real.

0
Every few years something shows up in AI that makes people stop and argue. Not argue about which model is better or whose benchmark is more honest. Argue about whether the rules just changed. SubQ is that argument right now. A Miami-based startup called Subquadratic came out of stealth last week with a single claim that's either the most important architectural shift since the 2017 transformer paper or the most sophisticated AI hype in recent memory. They say they've built the first LLM that doesn't rely on quadratic attention and that this lets them run a 12 million token context window at roughly one-fifth the cost of frontier models. The AI research community split within hours. Half are losing their minds. Half are explaining why this doesn't count. The truth is probably more interesting than either camp. Here's what we actually know.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy