back to top
HomeTechBaidu’s ERNIE 5.1 Is Rivaling Gemini 3.1 Pro at AI Search

Baidu’s ERNIE 5.1 Is Rivaling Gemini 3.1 Pro at AI Search

- Advertisement -

Baidu has been doing search longer than most AI companies have existed. While OpenAI was still a research lab and Anthropic hadn’t been founded yet, Baidu was already the dominant search engine for 1.4 billion people. Search is not something they learned recently.

So when ERNIE 5.1 lands 4th on the Search Arena global leaderboard above Gemini 3.1 Pro with Grounding, above GPT-5.4 search and even above Google’s own search-augmented model, it’s surprising only if you forgot who built it.

Most Western AI coverage treats ERNIE as a footnote. These numbers suggest that’s worth reconsidering.

4th Globally on Search. Above Google’s Gemini.

aren ai search and text rankings for earnie 5.1
via: baidu.com

Search Arena is an independent evaluation where real users compare how well different models answer questions that require searching and reasoning over current information. The kind of task where knowing a lot isn’t enough, the model has to find the right information, understand it, and produce a useful answer.

ERNIE 5.1 scores 1,223 Elo on Search Arena. Claude Opus 4.6 Search leads at 1,255. GPT-5.5 Search sits at 1,242. ERNIE is behind both but ahead of Claude Sonnet 4.6 Search, Gemini 3.1 Pro with Grounding, and every other model on that leaderboard.

For a model that most Western developers haven’t touched and most Western coverage hasn’t covered seriously, that placement deserves more than a footnote.

The text Arena tells a similar thing. ERNIE 5.1 Preview sits 13th globally at 1,476 Elo between GPT-5.5 and Grok-4.20 Multi Agent. These are global evaluations with actual competition from the models everyone talks about.

The benchmark numbers

Baidu Ernie 5.1 Benchmarks
via: baidu.com

AIME26 with tools at 99.6 puts ERNIE 5.1 second only to Gemini 3.1 Pro at 99.9 on competition mathematics with tool use. DeepSeek V4 Pro scores 92.6 on the same benchmark. Claude Opus 4.6 scores 81.2. The gap between ERNIE and the next closest model on that specific benchmark is not small.

τ3-bench, which tests multi-step agent execution across realistic scenarios, scores 67.9 narrowly ahead of DeepSeek V4 Pro at 67.5 and Gemini 3.1 Pro at 67.1. Claude Opus 4.6 leads that chart at 72.4.

SpreadsheetBench tests whether a model can handle real spreadsheet reasoning tasks — the kind of structured data work that comes up constantly in actual business workflows. ERNIE scores 72.5, ahead of DeepSeek V4 Pro at 67.0 and Gemini 3.1 Pro’s result.

But before reading too much into any of this. Baidu published these benchmarks. Most of the comparison numbers come from their own evaluation setup. The Search Arena and text Arena results are independently verified which is exactly why those two numbers are the ones worth leading with.

The technical bet that made this possible

Most model families are built in a different way. You decide you want a 7B model, a 13B model, and a 70B model, then you run three separate training jobs. 3x the compute, 3x the time, three results that aren’t guaranteed to be consistent with each other.

Baidu did something different with ERNIE 5.1. They trained one large model ERNIE 5.0 in a way that contains multiple model sizes within it simultaneously. During training the system randomly varies how many layers are active, how many experts participate in routing, and how many experts activate per token. The result is a single training run that produces an entire family of models at different sizes and compute budgets.

ERNIE 5.1 is extracted from that matrix rather than trained from scratch. Total parameters compressed to roughly one third of ERNIE 5.0. Active parameters at roughly half. Pre-training compute at 6% of what a comparable model trained from scratch would cost. It means Baidu can iterate faster than labs spending full compute on every model. The efficiency compounds into more experiments, faster feedback, and quicker improvements.

Whether that architectural advantage shows up in the benchmarks is exactly what the Search Arena result suggests it does.

Related: ERNIE-Image: Open-Source 8B Text-to-Image Model for Posters, Comics & Structured Generation

The four stage post training pipeline

One of the quiet problems in training large models is what researchers call the seesaw effect. Make the model better at coding and it gets slightly worse at creative writing. Improve math reasoning and instruction following regresses. Every capability gain trades against something else.

Baidu’s approach to ERNIE 5.1 post training attacks this problem directly. Instead of training everything together in one stage, they separate expert training from capability fusion across four sequential stages.

First a unified fine tuning stage establishes baseline instruction following. Then separate expert models for coding, reasoning, and agentic tasks are trained in parallel each one optimized independently without interference from the others. Then a distillation stage pulls all those expert capabilities into one unified model simultaneously. Finally a general reinforcement learning stage handles the capabilities that don’t distill well, creative writing, open ended conversation, tasks with high output diversity.

The result is a model that doesn’t have to compromise between capabilities because those capabilities were never in competition during training. Whether that holds up in practice is what the creative writing claims and the benchmark spread suggest, strong across multiple categories without obvious regressions.

You May Like: Best AI Coding Models for Consumer Hardware

Limitations

ERNIE 5.1 is not open weights. There is no Hugging Face release, so there is no way to run it locally. Access is through Baidu’s own products including the ERNIE website, Baidu AI Studio, and the creative platforms rolling out ERNIE 5.1 integration.

The benchmark situation is also worth being clear about. The Search Arena and text Arena results are independently verified and those are the numbers that matter most for assessing where this model actually stands. The agentic and knowledge benchmarks come from Baidu’s own evaluation setup. Plausible given the Arena results but not independently confirmed.

Creative writing claims are the hardest to evaluate. Approaching Gemini 3.1 Pro on internal creative writing evaluations is a specific claim that requires actually using the model across many creative tasks to verify. Internal evaluations from the model’s own developer are the least reliable signal in the entire release.

Who should care

If you build search products, document retrieval systems, or anything where finding and synthesizing current information matters, ERNIE 5.1’s Search Arena placement is worth paying attention to. 4th globally on an independent leaderboard above Google’s own search model is a signal that deserves more than the coverage it’s getting.

If you work with Chinese language content or serve Chinese-speaking users, this is the most capable model available for that use case by a significant margin.

If you need open weights, local deployment, or hardware you control, this isn’t the model for that. Baidu keeps ERNIE behind their own products and that’s unlikely to change.

The broader point is simpler. Western AI coverage has a blind spot around models that aren’t from OpenAI, Anthropic, Google, or the handful of Chinese labs that release open weights. ERNIE 5.1 sits 13th globally on text Arena and 4th on Search Arena. That’s not a regional result. That’s a global one and it deserves to be treated that way.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
glm 5.2 ai open weights

GLM-5.2 Is the Closest an Open Model Has Come to Claude

0
What does it take for an open-weight model to stop chasing Claude and actually beat it? Every open-weight release for two years has told some version of the same story: closer, but not quite. The chart shrinks, the wording softens to "competitive with," and the conversation moves on until the next model repeats the cycle. GLM-5.2 breaks that pattern. The model is built to survive long, messy coding work, the kind that runs for hours without losing the thread. That's the pitch its maker is leading with. But scroll down their own benchmark table and something else is sitting there quietly: on a couple of standard math evals, this open model isn't approaching Claude Opus 4.8, GPT-5.5, or Gemini 3.1 Pro. It's beating all three, on the same table. It loses plenty of ground elsewhere, and that part matters just as much as the wins. But a model anyone can download under an MIT license, with no usage restrictions attached, coming out ahead of the lab everyone else measures themselves against, is worth pausing on before getting to what the rest of the numbers actually say.
Open-Source AI Tools Worth Trying Right Now

5 Open-Source AI Tools You Probably Haven’t Tried Yet

0
Every week brings another open source AI release, and most of them require setting up a Python environment. Find out the model card lied about VRAM requirements. By the time something actually runs, the appeal has mostly worn off. The five tools below skip most of that. One turns image and video generation into something closer to a desktop app. One gives DeepSeek an actual workspace instead of a browser tab. One builds UI prototypes using coding agents you probably already have installed. One quietly builds a memory system out of your own apps. And one is, literally, a desktop pet.
Claude Mythos 5 and Claude Fable 5

Claude Mythos 5 Was Too Powerful to Ship. Anthropic Released Fable 5 Instead.

0
Anthropic gave stripe early access to Fable 5 and set it loose on a 50 million line Ruby codebase. The migration that would have taken a full engineering team over two months got done in a day. That's a real company's real codebase and a task with real consequences if it goes wrong. Anthropic leads with it because it's the kind of result that's hard to argue with & because it sets up everything else they need to tell you about why this launch looks the way it does. Because here's the thing. The model Anthropic actually built Claude Mythos 5, isn't what most people are getting today. What's going live for general use is Claude Fable 5. Same underlying model. Different version. The parts Anthropic decided were too dangerous for public release got a separate wrapper, a separate name, and a separate approval process controlled in part by the US government.