Do Gemma 4 models work well?

#65
by Regrin - opened

Good afternoon!
I've come across mixed reviews of Gemma 4. Some say it's a breakthrough model. Others say it's glitchy, confuses tokens, and writes slightly different words instead of one.

Maybe it's because the people at LM Studio have bad quants?

They say Qwen 3.5 works better.

My model, I think the Gemma 4 E4B or E2B, has started thinking twice in Ollama.

Ollama is very technically dated at this point, so I can't tell anything about its level of support for Gemma 4. It's not glitchy or "confusing tokens" whatever is that supposed to mean. It can't "think twice" either, that's your server doing something funky.

If you're comparing Gemma 4 31B (this is where you decided to post this) and Qwen 3.5 27B (I can't say anything about the MoE models, i avoid those on principle), Then, for Gemma 4:

  • Its KV cache (the part of your memory used to store the context size) is extremely compact, meaning that despite being a larger model, it uses comparable amounts of VRAM to the 27B model at equal max context.
  • Its thinking block is a lot shorter, and I've yet to see it go into infinite loops.
  • It's more permissive / less censored out of the box
  • It's much less prone to hallucinations, but like any model of this size, it's worthless if you intend to use it as a knowledge base.
  • It's sampler-friendly, you can tweak them to your heart's content
  • It speaks much better English

However,

  • It's very particular about its instruction template, making it a bit of a pain for JSON / structured output, even if it can be worked around.
  • I can't speak about tool-calling and agentic much yet. Qwen 3.5 27B is really punching above its weight in that domain, and while I didn't have the time to test Gemma 4's abilities, there's no way it's anywhere close.
  • I haven't compared image recognition abilities yet.
  • The "assistant"-speak is really drilled down into it, and it has a tendency to write longer and longer responses when not prompted adversely against those traits.
  • Looks like a bulk of the training copyrighted materials got removed, meaning that it'll struggle at "impersonating" someone famous
  • It loves its em-dashes and slop writing a bit too much

It really depends on your use case. For agentic, I'd use Qwen. For text editing/summary/translation/..., especially in English, I'd definitely use Gemma 4.

What do you mean when you say Ollama is outdated?

Is it developing slowly?

That too.

But its inference speed is very low compared to practically everything else which is reason enough on its own (lmstudio is also kinda slow). It's lagging behind on support: new architectures are added more slowly and their support is flaky at the best of times. The feature-set for people who actually use those models beyond "casual chatting" is very limited. And I could go at length about why its API has been written by someone who never used their own product, but I don't think you'd care. At this point, the only handy part is that you can download a model from the command line.

Nowadays things are more or less like this (give or take) in the quantized world:

ik_llama > llama.cpp (or koboldCpp for casual-friendly) > lmstudio > ollama

Sign up or log in to comment