Gemma 4 12B: All Hype, No Substance

Every few weeks a new model lands with a wave of charts showing it beating everything in its size class. Gemma 4 12B arrived exactly like that — big numbers, big claims, a lot of excitement. So I dropped it into the one place that actually matters to me: the brain behind my sovereign brain. Our first round of testing was a complete failure.

The benchmarks said one thing

On paper, Gemma 4 12B looks like a steal. Mid-size, strong scores, the kind of model you’d assume slots right in. There’s even the claim that it’ll happily run on a PC with just 24GB of RAM — which I find very doubtful, or at best true at near-zero tokens per second, which is the same as useless. That’s the trap. Benchmarks (and spec-sheet claims) measure narrow, well-defined conditions that look nothing like real work. A second brain is the opposite: long sessions, real files, tool calls, and a hundred small details that have to go right in a row. None of that shows up on a leaderboard.

The real tasks said another

I put it through the same job I give every candidate — be the engine driving my sovereign brain — and it fell apart fast.

2x slower than Qwen3.6 35B A3B. For a 12B dense model that’s not the bargain the size implies. Qwen3.6’s MoE design activates only ~3B per token, so the smaller-on-paper Gemma is the one that drags. You feel every second of it in a working session.
Half the usable context. A second brain lives or dies on how much it can hold in view — your files, the task, the conversation so far. Gemma 4 12B gave me roughly half the room Qwen3.6 does, so it loses the thread sooner.
It can’t keep filenames straight. This was the killer. It kept confusing underscores and dashes — reaching for my_file.md when the file on disk was my-file.md. One character, and the tool call fails.
And then it loops. A failed call should trigger a rethink. Instead Gemma retried the same wrong filename, failed again, retried again — stuck in a loop until I stopped it. A small mistake compounding into a dead end is exactly what you can’t have in the thing you trust with your thinking.

A model that’s slower, holds less, and gets stuck retrying its own typos isn’t a thinking partner. It’s a liability.

Don’t follow the benchmarks and the hype

This is the lesson, and it’s worth repeating every time a new model drops: don’t just follow the benchmarks and the hype. The charts told me Gemma 4 12B was a contender. Ten minutes of real work told me the truth.

The fix is simple and it costs nothing: try some real tasks. Hand the model a genuine job from your own workflow — edit the right file, call the right tool, hold a long session together — and you’ll see the difference almost immediately. The gap between a model that benchmarks well and a model that works is enormous, and it only shows up when you do the actual work.

For now, Gemma 4 12B is a no from me. If you want a small local model that actually clears the bar, Qwen3.6 35B A3B is still the one to use.

Build your own

The tools are open source and free.

Take what you need and create your sovereign brain at sovereignbrain.me.