back to top
HomeTechDeepSeek-V4 Can Hold Your Entire Codebase in One Context Window and It's...

DeepSeek-V4 Can Hold Your Entire Codebase in One Context Window and It’s Open Source

- Advertisement -

Every developer who has worked with long context models knows the feeling. You paste in your codebase, add your requirements, include some examples, and somewhere around the halfway point the model starts forgetting things it read at the top. You get generic answers. It misses files it already saw. The context window is technically full but the model is functionally half-asleep.

This is called the performance cliff and it is the real problem with long context AI, not the number itself. DeepSeek-V4 is making a specific claim here. Not just that it supports 1M tokens, several models do that now. The claim is that it stays useful across that entire window by fundamentally changing how attention works at scale. In the 1M token setting, V4-Pro requires only 27% of the compute per token and 10% of the KV cache compared to DeepSeek-V3.2.

It is MIT licensed. Weights are on HuggingFace right now. And they shipped two models simultaneously, which means there is an actual choice to make depending on what you are building.

Two models, one decision

DeepSeek-V4 comes in two variants. Pro at 1.6T total parameters with 49B active per token, and Flash at 284B total with 13B active. Both support the full 1 million token context window.

The active parameter number is what matters for inference cost, not the total. Pro activates 49B per token, Flash activates 13B. Flash is cheaper to run and still competitive on most benchmarks. Pro is where you go when you need maximum reasoning depth on the hardest tasks.

Think of it this way. Flash is the model you run in production at volume. Pro is the model you run when the task genuinely needs the full weight of 49B active parameters thinking through it. Both use FP4 precision for MoE experts and FP8 for most other parameters, which keeps memory requirements more manageable than the raw numbers suggest.

For most developers the honest starting point is Flash. Move to Pro when Flash hits a ceiling on your specific task.

How they made 1M tokens actually efficient

Standard attention has a quadratic problem. The more tokens you add, the more expensive every computation becomes, and the more memory you need to store the context. At 1M tokens this becomes genuinely brutal. Most models that support long context are doing it expensively and paying for it in either quality or cost.

DeepSeek-V4 uses two attention mechanisms working together. Compressed Sparse Attention handles most layers with a narrow sliding window, only attending to nearby tokens rather than the entire context. Heavily Compressed Attention handles a smaller number of layers with global reach but aggressive compression. The result is that the model maintains awareness of the full context without paying full attention cost at every layer.

They also added Manifold-Constrained Hyper-Connections on top of standard residual connections. The practical effect is more stable signal propagation across layers, which matters a lot when you are processing a million tokens and information needs to travel a long way through the network without degrading.

The Muon optimizer replaced Adam during training, which DeepSeek says delivered faster convergence and more stable training runs at this scale.

Put together these changes explain the 27% compute and 10% KV cache numbers. The model is not just bigger, it is structurally different in how it handles length.

One Model with Three Modes

Both V4-Pro and V4-Flash support three reasoning modes and the difference between them is significant enough to affect how you build with it.

Non-think mode gives you fast direct responses. No visible reasoning, low latency, good for routine tasks where speed matters more than depth. Think High is where the model reasons through the problem before answering, slower but meaningfully more accurate on complex tasks. Think Max pushes reasoning as far as the model can go, requires a larger context window of at least 384K tokens, and is designed for tasks where you genuinely need the ceiling.

The benchmark gap between modes is not subtle. HMMT is a prestigious math competition benchmark, the kind of problems that take human contestants hours. On HMMT 2026, V4-Flash in Non-think mode scores 40.8. The same model in Think Max scores 94.8. HLE, Humanity’s Last Exam, is one of the hardest knowledge and reasoning benchmarks available, designed to stump even expert humans. Non-think gives you 8.1 on it. Think Max gives you 34.8. Same model, same weights, completely different capability depending on how much reasoning budget you give it.

That gap is not about the model getting smarter. It is about giving it time to actually think. This matters practically because you do not need to choose a different model for different task types. You choose a mode. Routine queries run cheap on Non-think. Hard problems get Think Max when the answer actually matters.

The part most people skip

Supporting 1M tokens is easy to claim. Actually staying useful across that entire window is a different problem.

DeepSeek published two benchmarks that test this honestly. MRCR, which measures how well a model retrieves and reasons over information buried deep inside a massive context, and CorpusQA, which tests question answering across extremely long documents. Both evaluated at the full 1M token length. On MRCR, V4-Pro scores 83.5 against Gemini 3.1 Pro at 76.3. On CorpusQA it scores 62.0 against Gemini’s 53.8. Claude Opus 4.6 scores 92.9 on MRCR but drops to 71.7 on CorpusQA. These are self-reported numbers but the pattern holds. V4 maintains coherence at lengths where most models quietly degrade.

The post-training story is also worth understanding. DeepSeek didn’t train one model on everything simultaneously. They trained separate domain-specific expert models first, each one specialized in its own area, then consolidated them into a single model through on-policy distillation. The practical result is a model that performs consistently across coding, math, and reasoning rather than being strong in one domain and mediocre everywhere else. That consistency is harder to achieve than the benchmark numbers suggest and it explains why V4 doesn’t have obvious weak spots the way some frontier models do.

What the benchmarks show

Deepseek v4 pro benchmarks
Deepseek v4 model benchmark compare
via: deepseek.com
BenchmarkV4-Pro MaxClaude Opus 4.6GPT-5.4Gemini 3.1 Pro
LiveCodeBench93.588.8n/a91.7
SWE Verified80.680.8n/a80.6
SWE Pro55.457.357.754.2
BrowseComp83.483.782.785.9
GPQA Diamond90.191.393.094.3

On LiveCodeBench V4-Pro leads the group. On SWE benchmarks it sits within a point or two of Claude Opus 4.6 and Gemini 3.1 Pro. On pure reasoning benchmarks like GPQA Diamond and HLE it trails the top closed models. This is a coding and long context model first, general reasoning model second. The benchmarks reflect that accurately. One thing worth knowing is that these are self-reported numbers from DeepSeek. We’ve trimmed the table to the benchmarks that tell the most honest story. Full results are on the DeepSeek HuggingFace page.

How to run it

For local deployment DeepSeek recommends temperature 1.0 and top_p 1.0. Think Max mode needs at least 384K context window. The model uses a custom chat template so you will need the encoding scripts from the repo rather than a standard Jinja template. The encoding folder has Python scripts and test cases that walk you through it. SGLang is the recommended inference engine for serious deployments.

Realistically V4-Pro at 1.6T parameters is not a consumer hardware project. Flash at 284B with 13B active is more approachable but still needs serious infrastructure. If you want to run either locally right now, the API is the honest answer for most people.

The good news is the ecosystem is already moving. DeepSeek-V4 is available on OpenRouter. More providers will follow quickly, that’s been the pattern with every major DeepSeek release. Quantized versions from the community will also show up on platforms like Ollama as people work through the weights, which is typically when local deployment becomes realistic for developers without a cluster.

Until then the API route through OpenRouter or DeepSeek’s own platform is where most people will get their first real time with it.

Cloud pricing

If you’re thinking to use DeepSeek’s own platform then the pricing is competitive for what you get.

ModelInput (cache hit)Input (cache miss)OutputContext
V4-Pro$0.145$1.74$3.481M
V4-Flash$0.028$0.14$0.281M

Cache hits matter more here than with most models. If you are running repeated queries against the same large codebase the cached input price on V4-Pro drops from $1.74 to $0.145 per million tokens. That changes the economics significantly for long context workflows where the same files appear across many requests.

V4-Flash at $0.14 per million input tokens is genuinely cheap for a model performing at this level. Worth starting there before deciding if V4-Pro’s extra depth justifies the cost on your specific workload.

Who this is for

Developers working with large codebases who keep hitting context limits with current models. If your workflow involves feeding entire repositories into a model, V4 is the open source option where the architecture is specifically designed to handle that without degrading.

Researchers who need genuine long document understanding at 1M tokens. The CorpusQA results at 1M context are the most honest test of whether long context actually works, and V4-Pro holds up where most models fall apart.

Teams running agentic workflows at scale who want frontier-competitive performance without closed model pricing. MIT license means you build commercially without any conversation about terms.

And if you have been watching DeepSeek since V3 and wondering when the open source community would get a genuine 1M context model that actually works efficiently at that length, this is that release.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
Amazon Added AI Merch to Its Shopping App

Amazon Just Made Print-on-Demand a Default Shopping Feature. The Platforms Built Around It Should...

0
Amazon didn't hold a press event for this. Just a quiet update to the Shopping app, tap the Alexa icon, describe what you want on a T-shirt, watch it appear. Add to cart. Prime shipping handles the rest. That's it. That's the whole barrier now. For years, turning an idea into a physical product meant either learning design tools, hiring someone who had, or finding a platform that made it slightly less painful. Print-on-demand services like Redbubble and Fourthwall built real businesses around that problem. Amazon just solved that problem too.
ideogram 4.0 ai model

Ideogram 4 Topped the Open-Weight Leaderboard. Then We Read the License.

0
Ideogram was founded by former Google Brain researchers who worked on Imagen, Google's own text-to-image system. When that team releases an open-weight model, you pay attention. Ideogram 4 tops the open-weight design leaderboard by a margin that isn't close. Professional designers picked it first in blind typography tests nearly half the time. At 9.3B parameters it beats open models three times its size on text rendering. Then we read the license.
Google Built Gemma 4 12B Without Multimodal Encoders

Google Built Gemma 4 12B Without Multimodal Encoders

0
Every multimodal model you've used has the same basic system. Text goes in one way, images go through a vision encoder first, audio goes through an audio encoder first, and then everything gets handed off to the language model in a form it can work with. The encoders are load-bearing and you don't just remove them.Google actually removed them.Gemma 4 12B takes raw image patches and raw audio waveforms and projects them directly into the same embedding space as text tokens. There is no vision encoder or audio encoder. One decoder handling everything.