Can you guys make a smaller dense?

#29
by Nesy1 - opened

Honestly, Gemma 31B has been very nice. However, I was wondering if you guys could make a:

27B dense or 24B dense version for hardware plebs like us. (The MoE, I am not sure about, but that's all)

Let me know what you people think!

why the fuck would you need a 31B AND a 27B just use a lower quant, there are legit a billion better things you could've complained about

why the fuck would you need a 31B AND a 27B just use a lower quant, there are legit a billion better things you could've complained about

Lower quants cause severe degredation under Q4 for example, even with Imatrix. People do not recommend Q3 unless if you are desperate or have low VRAM. Arguably, a lot of people say that taking a lower paramater model is better then taking a lower quant of a high param model. "Just use a lower quant" is not something that cuts it. (This obviously doesn't apply as much for higher param models beyond 70B or so).

27B Q4_K_S or IQ4_XS can allow people with 16GB of VRAM to fit a model with about 12-20K context roughly on Qwen3.5. Giving people headroom IS a good and a valid thing to complain about. Having multiple options ia a good idea. This is like asking: "Well, why do we want a 70B version and a 24B version of our model?" because likely people want users to have the smoothest experience in running models without having to deal with quanting down severely, or running low on context.

And I would heavily suggest not reframing my quesiton as a complaint rather then a simple suggestion. Thank you.

MoE actually solves the problem you're describing. I'm running a 26B MoE at a high-quality Q4_K_XL quant and getting 17.68 tok/sec. It allows me to keep the model's intelligence high without needing massive VRAM or settling for the severe degradation that comes with the low-bit quants you're suggesting. RTX 2060 (12GB)

just download more vram xd πŸ€—

just download more vram xd πŸ€—

Okay mr Jensen.

image

MoE actually solves the problem you're describing. I'm running a 26B MoE at a high-quality Q4_K_XL quant and getting 17.68 tok/sec. It allows me to keep the model's intelligence high without needing massive VRAM or settling for the severe degradation that comes with the low-bit quants you're suggesting. RTX 2060 (12GB)

I mean sure, but I find that Gemma 31B (naturally its better) then the MoE and I have a hard time looking back. I almost feel like Gemma 27B would be better then the 26B A4B MoE, but, maybe MoE's improve so much that a 27B dense is not viable.

Sign up or log in to comment