The open source model space has genuinely caught up. There are models today that genuinely rival GPT-5 and Claude Opus level performance and you can download their weights for free. The problem is running them. A 70B model at full precision wants an A100.
Most developers aren’t working with that. They’re on an M2 MacBook Pro, an RTX 4060, maybe a gaming PC with 16GB of VRAM. That’s exactly the hardware gap these five models are trying to close. All open source and capable enough to handle real coding work, and runnable on mid-range consumer hardware
Table of Contents
1. Gemma 4 E4B-IT

Google DeepMind doesn’t usually get mentioned in the same breath as the open source releases coming out of Chinese labs and independent research teams. Gemma 4 E4B-IT might change that.
The E4B has 4.5 billion effective parameters, the “E” stands for effective, because Google uses a technique called Per-Layer Embeddings that inflates the total parameter count to 8B while keeping the actual compute closer to a true 4B model. What that means practically is you get a model that performs well beyond what 4.5B parameters would suggest.
It’s multimodal out of the box. Text, images, and audio all handled natively — which puts it in rare company at this size. The context window sits at 128K tokens, enough to load a meaningful chunk of a codebase into a single prompt.
On coding specifically, it’s honest to say this isn’t the strongest coder on this list. Codeforces ELO of 940 and LiveCodeBench v6 at 52% tell that plainly. Where it earns its spot is breadth, if your workflow involves reading a screenshot, analyzing a diagram, or processing audio alongside code, nothing else at this size comes close.
Apache 2.0, available on Ollama, and comfortable on 6-8GB of VRAM.
Capabilities:
- Text, image and audio understanding natively
- 128K context window
- Built-in thinking mode, configurable on or off
- Native function calling for agentic workflows
- Multilingual support across 35+ languages
- Runs on 6-8GB VRAM
2. gpt-oss-20B

OpenAI releasing open weights was unexpected. They’ve spent years building the case for why closed models are safer. Then they dropped two open weight models with full chain-of-thought access and an Apache 2.0 license.
The 20B is the one that is the relevant one here. It’s a MoE architecture with 3.6B active parameters, which means despite the 20B label it runs within 16GB of memory, manageable on a high-end consumer GPU or an M2 Pro and above.
On coding it holds up. Codeforces ELO of 2230 without tools and 2516 with tools puts it in serious company. For context that’s comfortably ahead of o3-mini’s 2073. AIME 2025 with tools hits 98.7%, actually edging out the 120B variant. These numbers are competitive with OpenAI’s own paid reasoning models.
The configurable reasoning effort is worth mentioning. Low for quick answers, medium for balanced responses, high for anything that needs actual thinking. For coding tasks where you want the model to reason through a problem. That control is important.
One thing to know about is, it needs the harmony response format to work correctly. Standard prompting won’t behave as expected. Ollama handles this automatically so if you’re pulling it that way you won’t notice, but it’s worth knowing if you’re integrating it directly.
Capabilities
- Codeforces ELO 2516 with tools, 2230 without
- Configurable reasoning effort, low, medium, high
- Full chain-of-thought access
- Native function calling and structured outputs
- Fine-tunable on consumer hardware
- Apache 2.0, available via Ollama
Related: Open Source LLMs That Rival ChatGPT and Claude
3. DeepSeek-R1-Distill-Llama-8B

DeepSeek-R1 is a 671B MoE reasoning model that made a lot of noise when it dropped earlier this year. Most people can’t run it. This is the version they can.
The Distill-Llama-8B is one of six smaller models DeepSeek released alongside R1, built by taking the reasoning patterns from the full 671B model and distilling them into a Llama 3.1-8B base. What comes out is an 8B model that reasons in a way most 8B models don’t, it basically self-verifies, reflects, and generates proper chain-of-thought before answering.
On coding it scores 39.6 on LiveCodeBench and lands a Codeforces rating of 1205. Respectable for 8B, though if raw coding benchmark numbers are your priority the gpt-oss-20B or Qwen further down this list will serve you better. Where this model belongs on this list is reasoning through problems like debugging logic errors, working through an algorithm step by step, catching edge cases. That’s where the distilled R1 behavior actually shows up.
It Runs comfortably on 8GB VRAM. MIT licensed. Available on Ollama.
Capabilities
- Self-verification and reflection built into reasoning
- Chain-of-thought inherited from 671B R1 model
- Codeforces rating 1205
- LiveCodeBench 39.6
- 128K context window
- MIT license, runs on 8GB VRAM
4. Qwen3.6-35B-A3B

Qwen has been putting out models fast enough that it’s easy to miss what actually changed between releases. Qwen3.6 grabs the attention specifically for agentic coding.
The 35B-A3B is a MoE model with only 3B active parameters. The 35B is what stays on disk. The 3B is what your hardware actually runs at inference time. It simply means the model thinks with the capacity of a much larger architecture while staying relatively light on compute.
What Qwen specifically improved with this release is how the model handles frontend workflows and repository-level reasoning. SWE-bench Verified at 73.4 is a real number, that benchmark tests whether a model can resolve actual GitHub issues in real codebases. Terminal-Bench 2.0 at 51.5 covers autonomous terminal task execution. These are agentic coding results.
The thinking preservation feature is genuinely useful for iterative development. By default models forget their reasoning between turns. Qwen3.6 can retain reasoning context from previous messages, which reduces redundant thinking and keeps the model consistent across a long back-and-forth coding session.
The 3B active parameters sounds light but the full 35B weights still load into memory. With Q4 quantization via Ollama or a GGUF loaded through Jan AI you’re looking at 20GB+. M2 Pro 32GB or a 24GB GPU is the realistic target.
Capabilities
- SWE-bench Verified 73.4, real GitHub issue resolution
- Terminal-Bench 2.0 at 51.5
- 3B active parameters despite 35B total
- Thinking preservation across conversation turns
- 262K native context window
- Agentic coding with MCP support via Qwen-Agent
- Apache 2.0 License
5. Phi-4 14B

Microsoft’s approach to small models has always been a bit different. While most labs race to the top with bigger parameter counts, the Phi series has consistently focused on how good can a small model get if you’re obsessive enough about training data quality?
Phi-4 at 14B is the answer they landed on in late 2024. Trained on 9.8 trillion tokens of carefully curated synthetic data, academic books, and filtered web content. The result is a model that consistently pushes above its weight class on reasoning and math. GPQA at 56.1 actually beats GPT-4o’s 50.6, which is a strong result for a 14B model
On coding, HumanEval sits at 82.6. Solid without being spectacular. Python is where it leads, the training data is heavily Python-weighted, so if your work lives in that ecosystem you’ll feel the difference. Other languages work but Python is where it’s most reliable
The practical advantage here is hardware. Q4 quantized or as a GGUF, it stays around 8-9GB in size that is comfortable on an RTX 4060, a base M2, or most mid-range setups on this list. MIT licensed.
But before you continue with this model its important to know that context window is 16K, shortest on this list by a significant margin. And multilingual support is weak, this is an English-first model and doesn’t pretend otherwise.
Capabilities
- GPQA 56.1, beating GPT-4o at this task
- HumanEval 82.6
- Python-first coding with strong reasoning
- 8-9GB VRAM with Q4 quantization
- MIT license
- 16K context window
You May Like: Top AI Image Generators You Can Run Locally
Which one fits your setup
| Model | Maker | Active Params | VRAM needed | Context | License | Best for |
|---|---|---|---|---|---|---|
| Gemma 4 E4B-IT | 4.5B | 6-8GB | 128K | Apache 2.0 | Multimodal + accessibility | |
| gpt-oss-20B | OpenAI | 3.6B | 16GB | 128K | Apache 2.0 | Reasoning + tool calling |
| DeepSeek-R1-Distill-Llama-8B | DeepSeek | 8B | 8GB | 128K | MIT | Reasoning + debugging |
| Qwen3.6-35B-A3B | Qwen | 3B | 20GB+ | 262K | Apache 2.0 | Agentic coding |
| Phi-4 14B | Microsoft | 14B | 8-9GB | 16K | MIT | Reasoning + Python |
The open source model space is moving consistently. A year ago a locally running model that could handle real GitHub issues or compete with o3-mini on coding benchmarks would have sounded optimistic. These five exist today open weights.
The gap between frontier and local isn’t closed yet. But it’s closing faster. The day a truly frontier-level coding model runs on a mid-range consumer GPU isn’t a prediction anymore. it’s starting to look like a timeline.
We’ll keep updating this list as the space moves.




