back to top
HomeTechGoogle Built Gemma 4 12B Without Multimodal Encoders

Google Built Gemma 4 12B Without Multimodal Encoders

- Advertisement -

Every multimodal model you’ve used has the same basic system. Text goes in one way, images go through a vision encoder first, audio goes through an audio encoder first, and then everything gets handed off to the language model in a form it can work with. The encoders are load-bearing and you don’t just remove them.

Google actually removed them.

Gemma 4 12B takes raw image patches and raw audio waveforms and projects them directly into the same embedding space as text tokens. No vision encoder. No audio encoder. One decoder handling everything.

What encoders actually do and why removing them is a bet

The encoder’s job is translation. A vision encoder takes pixels and converts them into a representation the language model can reason over. An audio encoder does the same for waveforms. They’re trained specifically for this, dedicated components that have learned to compress visual and acoustic information into something a language model can use.

Removing them means the language model has to do that translation itself. Raw image patches go in as lightweight embeddings through a single matrix multiplication. Raw audio gets projected directly into the same dimensional space as text. The LLM backbone takes over from there.

That’s an important architectural bet for two reasons. First, dedicated encoders have years of specialized training behind them. Replacing that with a linear projection layer is aggressive simplification. Second, if it doesn’t work well, you get a model that’s bad at images and audio specifically, the things that make it multimodal in the first place.

Google’s argument, implicit in the design, is that a capable enough language model backbone doesn’t need the translation layer. It can learn to handle raw inputs directly.

The size slot Google chose for this experiment

The Gemma 4 family has five models. The E2B and E4B are edge models built for phones. The 26B MoE and 31B dense are serious compute options. The 12B sits in the middle and it’s the only one Google built encoder-free.

The edge models need every optimization they can get so they keep their encoders. The large models are where Google plays it safe on architecture. The 12B is where they tried something different, capable enough to be useful, small enough that an architectural experiment doesn’t cost as much if it goes sideways.

The practical result is a model that runs on 16GB of VRAM. A MacBook Pro with 16GB unified memory can run this. Most consumer laptops with a recent discrete GPU can run this. That’s the target and the encoder-free design is part of how Google hit it, fewer components means smaller footprint, less latency, and no separate encoder weights to load.

Related: Gemma 4 Makes Local AI Agents Actually Practical

The benchmarks that make the bet look smart

All numbers below are from Google’s own evaluation.

On AIME 2026 without tools, 12B scores 77.5 against the 26B MoE’s 88.3. That gap is there but 77.5 from a model running on a laptop is not a number you dismiss. LiveCodeBench puts it at 72.0, again behind the 26B at 77.1 but meaningfully ahead of where a 12B model had any right to be a year ago. GPQA Diamond at 78.8, that’s a hard science reasoning benchmark and 12B is within striking distance of the 26B’s 82.3.

MMMU Pro vision at 69.1, MATH-Vision at 79.7. For a model with no dedicated vision encoder those are genuinely good numbers.

The one place the encoder-free bet shows its cost is OmniDocBench, which tests document understanding. The 12B scores 0.164 average edit distance against the 26B’s 0.149. Lower is better here and the gap is small but it’s the benchmark where fine-grained visual detail matters most, exactly the scenario where a dedicated vision encoder would have the clearest advantage.

Limitations

Audio is capped at 30 seconds and it’s a hard limit. Anything longer gets cut off. For transcription of short voice notes or quick translations it’s fine. For anything resembling a real conversation or a meeting recording, it’s not the right tool.

Video support is also absent on the 12B entirely. The 26B and 31B handle video. The 12B doesn’t. If video understanding matters for your use case, you’re looking at a different model in the family.

Context window is 256K, which sounds large until you compare it to what’s shipping elsewhere right now. Minimax M3 just launched with 1M. For most local use cases 256K is plenty but worth knowing where the ceiling is.

Long context retrieval is also where the 12B shows its size most clearly. MRCR v2 8-needle at 128K scores 43.4 against the 31B’s 66.4. Hiding information deep in a long document and asking the model to find it is harder at this size.

How to run it

The quickest path is Ollama or LM Studio, both support the 12B and neither requires much setup. The weights are on Hugging Face under Apache 2.0, about as clean a license as you get for a Google model.

For thinking mode, set enable_thinking=True in the chat template. It’s off by default. Worth turning on for anything involving multi-step reasoning or complex document analysis. For simple queries leave it off, the latency difference is noticeable.

Another thing to note before you begin: image content should go before text in your prompt, audio after. Google flags this in the model card and it affects output quality enough that it’s not just a suggestion.

The bet paid off at this size

Google took the middle slot in the Gemma 4 family and used it to remove components that every other multimodal model treats as non-negotiable. No vision encoder, no audio encoder, one decoder doing everything. The benchmarks didn’t collapse. On reasoning and coding they came in closer to the 26B than the parameter gap would suggest.

Whether this architecture scales up is the genuinely open question. The 31B kept its encoders. Google wasn’t ready to make this bet at full size. But at 12B, running on a laptop, the encoder-free design holds up well enough.

The weights are on Hugging Face now. Worth finding out for yourself.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
MiniMax M3 Shows What Happens When AI Stops Thinking in Turns

MiniMax M3 Shows What Happens When AI Stops Thinking in Turns

0
Most models quit around submission 30 because they stop finding improvement and exit on their own. That's what happened when MiniMax ran a CUDA kernel optimization task against a field of frontier models. Every model except two called it done within the first 30 submissions. M3's best result came on submission 145. After 24 hours. After multiple plateaus where the numbers stopped moving and a reasonable model would have concluded there was nothing left to find. That's the thing MiniMax released yesterday. An AI model with a 1M token context window, native multimodality, and apparently a problem with knowing when to stop.
Anthropic Files for an IPO. AI Is Entering Its Public Company Era

Anthropic Files for an IPO. AI Is Entering Its Public Company Era.

0
Anthropic has officially taken its first step toward becoming a public company. In a brief announcement on Monday, the company said it had confidentially submitted a draft S-1 registration statement to the U.S. Securities and Exchange Commission for a proposed initial public offering. The filing doesn't reveal a share price, a fundraising target, or even a timeline. For now, it simply gives Anthropic the option to go public once the SEC review process is complete. Just a few years ago, Anthropic was a small group of former OpenAI researchers trying to build an alternative vision for advanced AI. Today, it sits among the handful of companies shaping the industry's future and that's why this filing matters. It's one of the world's most influential AI labs beginning the transition from a privately funded research company to a business that may eventually answer to public shareholders. For most of the AI boom, the biggest bets were made behind closed doors. Venture firms, sovereign wealth funds, and tech giants supplied the capital while the public watched from the outside. Anthropic's filing suggests that era may be starting to change.

OpenAI Says Its AI Solved an 80-Year-Old Math Problem. The Proof Surprised Mathematicians.

0
OpenAI says one of its internal reasoning models has solved a math problem that has been there on mathematicians' desks since 1946. The problem, first posed by legendary mathematician Paul Erdős, looks almost absurdly simple. Given a set of points on a flat plane, how many pairs can be exactly one unit apart? People have spent nearly 80 years trying to pin down the answer. OpenAI's model didn't just make progress on the problem. According to the company, it disproved a longstanding conjecture that many researchers believed was essentially correct.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy