back to top
HomeTechAI ModelsGPT-5.4 Is Outperforming Humans at Work. But the Real Story Is What...

GPT-5.4 Is Outperforming Humans at Work. But the Real Story Is What OpenAI Isn’t Telling You

- Advertisement -

OpenAI dropped their latest model yesterday and buried inside the benchmarks is a number that deserves more attention than it’s getting. On GDPval, a test that puts AI agents through real professional tasks across 44 actual occupations, GPT-5.4 matched or outperformed human professionals 83% of the time. The previous version sat at 71%. That’s not a small jump.

And this isn’t GPT writing emails or summarizing documents anymore. This version can move a mouse, click buttons, fill out forms, and work across applications the way a person sitting at a desk would. It scored 75% on OSWorld, a benchmark that tests exactly that. The average office worker scores 72.4%.

The model is already better at operating a computer than most people who use one for a living & 83% is just the beginning of what this release actually means.

The GDPval Number Nobody Is Talking About

The tasks GPT-5.4 was tested on are things real people get hired to do like sales presentations, accounting spreadsheets, urgent care schedules, manufacturing diagrams. The kind of output a junior hire would spend their first few months learning to produce.

The finance number is the one that stopped me. On investment banking modeling tasks, the Excel heavy work that junior analysts spend most of their first two years doing, GPT-5.4 scored 87.3%. GPT-5.2 was at 68.4%. Nearly 19 points in a single release.

To be fair, GDPval tests specific tasks, not entire careers. A job is more than its deliverables. But when the deliverables are exactly what junior roles are hired for, that distinction starts to feel thinner than it used to.

GPT-5.4 Is Not Just Answering Questions Anymore

Think about what a junior analyst actually does on a given day. They open a PDF, pull numbers from it, drop them into a spreadsheet, build a model, then paste results into a presentation. That’s not one task. That’s four applications, a lot of switching, and hours of work.

GPT-5.4 can now do that sequence without stopping. Not by generating text about it. By actually doing it across the applications, the same way a person would.

On an internal benchmark of spreadsheet modeling tasks specifically the kind a junior investment banking analyst would handle, it scored 87.3%. On presentations, human raters preferred GPT-5.4’s output 68% of the time over GPT-5.2’s. The quality gap between versions is noticeable enough that people can see it without being told which is which.

For developers building on top of this, GPT-5.4 also supports up to 1 million tokens of context. That means an agent can hold an entire project in memory, plan across it, execute steps, check its own work, and keep going without losing track of where it started.

That’s a different kind of tool than what most people picture when they think of ChatGPT.

The Parts OpenAI Won’t Tell You About GPT-5.4

The 83% number is real. But there are three things buried in this release that quietly put a ceiling on how far that number actually reaches in the real world.

The 1M context trap

GPT-5.4 technically supports a 1 million token context window. What OpenAI didn’t put in the headline is that anything beyond 272K tokens gets charged at 2x the normal rate. That’s not a feature, that’s a tax. If your workflow genuinely needs that full window, you’re paying double for the privilege. Treat 272K as the real limit and build around it.

The Cost Problem (Nobody is doing the math on)

To get that 83% human level performance you need GPT-5.4 Pro. That runs $30 per million input tokens and $180 per million output tokens. At that price point, for high volume repetitive work like data entry or customer support, the math doesn’t always favor the AI. A junior hire handling straightforward volume tasks can still be cheaper than running Pro at scale. The ROI just isn’t there yet for every use case.

The Security Ceiling

OpenAI’s own safety documentation flags GPT-5.4 as high cyber capability and wraps it in significant restrictions around anything that looks like offensive security work. The model won’t think creatively outside those guardrails. For white hat hackers and security researchers, the kind of outside the box thinking that makes someone genuinely good at that work is exactly what the model is prevented from doing.

The 83% Trade-Off: Power vs. Privacy

GPT-5.4 might be the most capable model available right now. But it arrives at a complicated moment for OpenAI.

On February 28th, OpenAI signed a deal with the Pentagon to deploy AI on classified military networks. The same day, ChatGPT uninstalls in the US jumped 295% according to Sensor Tower. One star reviews surged 775%. Claude hit number one on the US App Store for the first time, with downloads up 51% day over day.

People voted with their phones.

The contrast is hard to ignore. Anthropic got blacklisted as a national security risk for refusing to allow mass domestic surveillance and autonomous weapons without human oversight. OpenAI signed a deal. And now GPT-5.4, a model with native computer use capabilities and access to classified networks, is the most powerful version yet.

For professionals in 2026 the question isn’t just “does GPT-5.4 perform better.” It’s “where does my data go and what is it being used for.”

If that question matters to your work, alternatives exist. Claude is one. Local models like GLM-5 that run entirely on your own machine are another. The performance gap is closing faster than most people expected.

The 83% efficiency gain is real. So is the trade-off that comes with it.

The Bottom Line on GPT-5.4

GPT-5.4 is genuinely impressive. The benchmarks are real, the computer use capabilities are real, and the jump from GPT-5.2 is significant enough that it’s hard to dismiss.

But impressive and right for everyone are two different things. The pricing ceiling and the data privacy question deserve a place in your decision making alongside the 83% headline.

Use it if it fits your workflow. If it doesn’t, the alternatives are better than they’ve ever been.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
mimo v2.5 pro

MiMo-V2.5-Pro: A Coding Model Taking On Claude Opus 4.6 and GPT-5.4

0
Peking University gives its computer science students a compiler project every semester. Build a complete SysY compiler in Rust including lexer, parser, abstract syntax tree, IR code generation, assembly backend, performance optimization. The whole thing. Students typically need several weeks. MiMo-V2.5-Pro finished it in 4.3 hours. Perfect score. 233 out of 233 tests passed on a hidden test suite it had never seen. That's a real university project and a model that scored higher than most students who spent weeks on it. Xiaomi built this, which is still a sentence that takes a moment to process. V2.5-Pro is the next step up from MiMo-V2-Flash and its closed source for now, but Xiaomi has confirmed open source is coming for the V2.5 series. What V2.5-Pro adds over Flash is meaningful. Better long-horizon coherence, stronger agentic capabilities, and the ability to sustain complex tasks across more than a thousand tool calls without losing the thread.

Qwen3.6-27B: The Open Source Coding Model That Punches Way Above Its Size

0
There's a quiet assumption baked into how most people think about AI models. Bigger means better. More parameters means more capable. If you want the best results, you run the biggest thing you can afford. Qwen3.6-27B makes that assumption uncomfortable. It's a 27B dense model, fully open source under Apache 2.0, and on agentic coding benchmarks it beats Qwen3.5-397B — a model nearly fifteen times its size — across every major test. That's not a rounding error or a cherry-picked metric. It's a consistent pattern across SWE-Bench, Terminal-Bench, and frontend code generation. This doesn't mean bigger models are dead. It means the gap between what you can run locally and what only clusters could handle a year ago just got a lot narrower.
Kimi K2.6 Turn Your Documents Into Reusable Skills

Kimi K2.6: Turn Your Documents Into Reusable Skills and Let 50+ Agents Execute Them

0
There's a particular kind of frustration that comes with doing great work and then starting from scratch the next time you need to do it again. You wrote a brilliant research report last month. The structure was tight, the sourcing was solid, the tone was exactly right. Now a client wants something similar and you're staring at a blank page again. The previous report is sitting in a folder somewhere, useful as a reference but not as a tool. Kimi K2.6 is trying to fix that specific problem. And the way it goes about it is different enough from what other models are doing that it's worth paying attention to. The model itself is a 1T parameter MoE released under a Modified MIT license, more on what that means practically in a moment. But the architecture is almost secondary to what Moonshot AI built around it. Document to Skills, Agent Swarm, full stack generation from a single prompt. It's a system designed around the idea that one person should be able to operate like a team.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy