back to top
HomeTechDeepSeek-V4 Can Hold Your Entire Codebase in One Context Window and It's...

DeepSeek-V4 Can Hold Your Entire Codebase in One Context Window and It’s Open Source

- Advertisement -

Every developer who has worked with long context models knows the feeling. You paste in your codebase, add your requirements, include some examples, and somewhere around the halfway point the model starts forgetting things it read at the top. You get generic answers. It misses files it already saw. The context window is technically full but the model is functionally half-asleep.

This is called the performance cliff and it is the real problem with long context AI, not the number itself. DeepSeek-V4 is making a specific claim here. Not just that it supports 1M tokens, several models do that now. The claim is that it stays useful across that entire window by fundamentally changing how attention works at scale. In the 1M token setting, V4-Pro requires only 27% of the compute per token and 10% of the KV cache compared to DeepSeek-V3.2.

It is MIT licensed. Weights are on HuggingFace right now. And they shipped two models simultaneously, which means there is an actual choice to make depending on what you are building.

Two models, one decision

DeepSeek-V4 comes in two variants. Pro at 1.6T total parameters with 49B active per token, and Flash at 284B total with 13B active. Both support the full 1 million token context window.

The active parameter number is what matters for inference cost, not the total. Pro activates 49B per token, Flash activates 13B. Flash is cheaper to run and still competitive on most benchmarks. Pro is where you go when you need maximum reasoning depth on the hardest tasks.

Think of it this way. Flash is the model you run in production at volume. Pro is the model you run when the task genuinely needs the full weight of 49B active parameters thinking through it. Both use FP4 precision for MoE experts and FP8 for most other parameters, which keeps memory requirements more manageable than the raw numbers suggest.

For most developers the honest starting point is Flash. Move to Pro when Flash hits a ceiling on your specific task.

How they made 1M tokens actually efficient

Standard attention has a quadratic problem. The more tokens you add, the more expensive every computation becomes, and the more memory you need to store the context. At 1M tokens this becomes genuinely brutal. Most models that support long context are doing it expensively and paying for it in either quality or cost.

DeepSeek-V4 uses two attention mechanisms working together. Compressed Sparse Attention handles most layers with a narrow sliding window, only attending to nearby tokens rather than the entire context. Heavily Compressed Attention handles a smaller number of layers with global reach but aggressive compression. The result is that the model maintains awareness of the full context without paying full attention cost at every layer.

They also added Manifold-Constrained Hyper-Connections on top of standard residual connections. The practical effect is more stable signal propagation across layers, which matters a lot when you are processing a million tokens and information needs to travel a long way through the network without degrading.

The Muon optimizer replaced Adam during training, which DeepSeek says delivered faster convergence and more stable training runs at this scale.

Put together these changes explain the 27% compute and 10% KV cache numbers. The model is not just bigger, it is structurally different in how it handles length.

One Model with Three Modes

Both V4-Pro and V4-Flash support three reasoning modes and the difference between them is significant enough to affect how you build with it.

Non-think mode gives you fast direct responses. No visible reasoning, low latency, good for routine tasks where speed matters more than depth. Think High is where the model reasons through the problem before answering, slower but meaningfully more accurate on complex tasks. Think Max pushes reasoning as far as the model can go, requires a larger context window of at least 384K tokens, and is designed for tasks where you genuinely need the ceiling.

The benchmark gap between modes is not subtle. HMMT is a prestigious math competition benchmark, the kind of problems that take human contestants hours. On HMMT 2026, V4-Flash in Non-think mode scores 40.8. The same model in Think Max scores 94.8. HLE, Humanity’s Last Exam, is one of the hardest knowledge and reasoning benchmarks available, designed to stump even expert humans. Non-think gives you 8.1 on it. Think Max gives you 34.8. Same model, same weights, completely different capability depending on how much reasoning budget you give it.

That gap is not about the model getting smarter. It is about giving it time to actually think. This matters practically because you do not need to choose a different model for different task types. You choose a mode. Routine queries run cheap on Non-think. Hard problems get Think Max when the answer actually matters.

The part most people skip

Supporting 1M tokens is easy to claim. Actually staying useful across that entire window is a different problem.

DeepSeek published two benchmarks that test this honestly. MRCR, which measures how well a model retrieves and reasons over information buried deep inside a massive context, and CorpusQA, which tests question answering across extremely long documents. Both evaluated at the full 1M token length. On MRCR, V4-Pro scores 83.5 against Gemini 3.1 Pro at 76.3. On CorpusQA it scores 62.0 against Gemini’s 53.8. Claude Opus 4.6 scores 92.9 on MRCR but drops to 71.7 on CorpusQA. These are self-reported numbers but the pattern holds. V4 maintains coherence at lengths where most models quietly degrade.

The post-training story is also worth understanding. DeepSeek didn’t train one model on everything simultaneously. They trained separate domain-specific expert models first, each one specialized in its own area, then consolidated them into a single model through on-policy distillation. The practical result is a model that performs consistently across coding, math, and reasoning rather than being strong in one domain and mediocre everywhere else. That consistency is harder to achieve than the benchmark numbers suggest and it explains why V4 doesn’t have obvious weak spots the way some frontier models do.

What the benchmarks show

Deepseek v4 pro benchmarks
Deepseek v4 model benchmark compare
via: deepseek.com
BenchmarkV4-Pro MaxClaude Opus 4.6GPT-5.4Gemini 3.1 Pro
LiveCodeBench93.588.8n/a91.7
SWE Verified80.680.8n/a80.6
SWE Pro55.457.357.754.2
BrowseComp83.483.782.785.9
GPQA Diamond90.191.393.094.3

On LiveCodeBench V4-Pro leads the group. On SWE benchmarks it sits within a point or two of Claude Opus 4.6 and Gemini 3.1 Pro. On pure reasoning benchmarks like GPQA Diamond and HLE it trails the top closed models. This is a coding and long context model first, general reasoning model second. The benchmarks reflect that accurately. One thing worth knowing is that these are self-reported numbers from DeepSeek. We’ve trimmed the table to the benchmarks that tell the most honest story. Full results are on the DeepSeek HuggingFace page.

How to run it

For local deployment DeepSeek recommends temperature 1.0 and top_p 1.0. Think Max mode needs at least 384K context window. The model uses a custom chat template so you will need the encoding scripts from the repo rather than a standard Jinja template. The encoding folder has Python scripts and test cases that walk you through it. SGLang is the recommended inference engine for serious deployments.

Realistically V4-Pro at 1.6T parameters is not a consumer hardware project. Flash at 284B with 13B active is more approachable but still needs serious infrastructure. If you want to run either locally right now, the API is the honest answer for most people.

The good news is the ecosystem is already moving. DeepSeek-V4 is available on OpenRouter. More providers will follow quickly, that’s been the pattern with every major DeepSeek release. Quantized versions from the community will also show up on platforms like Ollama as people work through the weights, which is typically when local deployment becomes realistic for developers without a cluster.

Until then the API route through OpenRouter or DeepSeek’s own platform is where most people will get their first real time with it.

Cloud pricing

If you’re thinking to use DeepSeek’s own platform then the pricing is competitive for what you get.

ModelInput (cache hit)Input (cache miss)OutputContext
V4-Pro$0.145$1.74$3.481M
V4-Flash$0.028$0.14$0.281M

Cache hits matter more here than with most models. If you are running repeated queries against the same large codebase the cached input price on V4-Pro drops from $1.74 to $0.145 per million tokens. That changes the economics significantly for long context workflows where the same files appear across many requests.

V4-Flash at $0.14 per million input tokens is genuinely cheap for a model performing at this level. Worth starting there before deciding if V4-Pro’s extra depth justifies the cost on your specific workload.

Who this is for

Developers working with large codebases who keep hitting context limits with current models. If your workflow involves feeding entire repositories into a model, V4 is the open source option where the architecture is specifically designed to handle that without degrading.

Researchers who need genuine long document understanding at 1M tokens. The CorpusQA results at 1M context are the most honest test of whether long context actually works, and V4-Pro holds up where most models fall apart.

Teams running agentic workflows at scale who want frontier-competitive performance without closed model pricing. MIT license means you build commercially without any conversation about terms.

And if you have been watching DeepSeek since V3 and wondering when the open source community would get a genuine 1M context model that actually works efficiently at that length, this is that release.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
mimo v2.5 pro

MiMo-V2.5-Pro: A Coding Model Taking On Claude Opus 4.6 and GPT-5.4

0
Peking University gives its computer science students a compiler project every semester. Build a complete SysY compiler in Rust including lexer, parser, abstract syntax tree, IR code generation, assembly backend, performance optimization. The whole thing. Students typically need several weeks. MiMo-V2.5-Pro finished it in 4.3 hours. Perfect score. 233 out of 233 tests passed on a hidden test suite it had never seen. That's a real university project and a model that scored higher than most students who spent weeks on it. Xiaomi built this, which is still a sentence that takes a moment to process. V2.5-Pro is the next step up from MiMo-V2-Flash and its closed source for now, but Xiaomi has confirmed open source is coming for the V2.5 series. What V2.5-Pro adds over Flash is meaningful. Better long-horizon coherence, stronger agentic capabilities, and the ability to sustain complex tasks across more than a thousand tool calls without losing the thread.

Qwen3.6-27B: The Open Source Coding Model That Punches Way Above Its Size

0
There's a quiet assumption baked into how most people think about AI models. Bigger means better. More parameters means more capable. If you want the best results, you run the biggest thing you can afford. Qwen3.6-27B makes that assumption uncomfortable. It's a 27B dense model, fully open source under Apache 2.0, and on agentic coding benchmarks it beats Qwen3.5-397B — a model nearly fifteen times its size — across every major test. That's not a rounding error or a cherry-picked metric. It's a consistent pattern across SWE-Bench, Terminal-Bench, and frontend code generation. This doesn't mean bigger models are dead. It means the gap between what you can run locally and what only clusters could handle a year ago just got a lot narrower.
Kimi K2.6 Turn Your Documents Into Reusable Skills

Kimi K2.6: Turn Your Documents Into Reusable Skills and Let 50+ Agents Execute Them

0
There's a particular kind of frustration that comes with doing great work and then starting from scratch the next time you need to do it again. You wrote a brilliant research report last month. The structure was tight, the sourcing was solid, the tone was exactly right. Now a client wants something similar and you're staring at a blank page again. The previous report is sitting in a folder somewhere, useful as a reference but not as a tool. Kimi K2.6 is trying to fix that specific problem. And the way it goes about it is different enough from what other models are doing that it's worth paying attention to. The model itself is a 1T parameter MoE released under a Modified MIT license, more on what that means practically in a moment. But the architecture is almost secondary to what Moonshot AI built around it. Document to Skills, Agent Swarm, full stack generation from a single prompt. It's a system designed around the idea that one person should be able to operate like a team.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy