Why is this 4bit version has a 32.7 GB size?
I've been expecting its size around ~ 20GB to fit into 1 x RTX5090
The size is closer to 8bit btw.
Tensor type
BF16 F8_E4M3 U8
lol...
Tensor type
BF16 F8_E4M3 U8lol...
No matter... we want it smaller to fit in consumer-grade video cards )
This 32.7GB build isn't a standard 4-bit quant; it’s a Compound AI Architecture optimized for the M5/RTX 6000 era. By leveraging Mixed-Precision Tensors, we maintain logic-critical layers in BF16, while offloading heavy computation to F8_E4M3 for a 4.5x throughput boost. The inclusion of U8-indexed KV Caching and speculative decoding allows for a staggering 400 t/s without the typical perplexity degradation of legacy INT4. When wrapped in a RAG + Validation Gates pipeline, this local engine effectively bridges the gap to frontier cloud models. High-density engineering for devs who prioritize local privacy without sacrificing 'GPT-5.2' class reasoning.
The model size was intentionally set so that it wouldn't work well with 32GB GPUs. Perhaps they want to sell more RTX Pro 6000s.
try prithivMLmods/gemma-4-31B-it-NVFP4
i mean if it's faster in general, that it's also still faster when vram swapping.
The real question is, how good is it still. That's the big claim up there. So if it's faster and still good, that's a net benefit.
All academic, because it is the fully censored (i.e. useless) base version anyway.
i mean if it's faster in general, that it's also still faster when vram swapping.
The real question is, how good is it still. That's the big claim up there. So if it's faster and still good, that's a net benefit.All academic, because it is the fully censored (i.e. useless) base version anyway.
I don’t get it . Given it’s dense model , If it needs significant swapping/ offloading to system RAM on , say Rtx 5090 32GB , then( especially considering some additional VRAM needed for some KV cache) -it would be way slower than Rtx pro 6000 or theoretical 5090 with 48GB VRAM, right?
Also , why it’s useless? Genuine question. Does it hurt something like coding assistance?
The model size was intentionally set so that it wouldn't work well with 32GB GPUs. Perhaps they want to sell more RTX Pro 6000s.
try prithivMLmods/gemma-4-31B-it-NVFP4
If I had a RTX Pro 6000 Blackwell I would just run the original version...
way slower yes. Usable? Yes. Depends how much you have to do with it. Since unified memory is all the rage, it can't be that bad.
But 31b seems to be very slow in general compared with similar dense models. Maybe teething issues..
Finally, I discovered the unofficial NVFP4 qant (23GB). Tested it out. It works pretty well on 1 x RTX 5090 and vLLM.
But still, I can't realize why NVIDIA can't release their official NVFP4 (4biT!!!!!!!!!) qant of this astonishing model that we could be sure that everything is quantized properly. Come on, NVIDIA, the community is waiting. For you is 2ms of work, for us it's a lifebuoy!
way slower yes. Usable? Yes. Depends how much you have to do with it. Since unified memory is all the rage, it can't be that bad.
But 31b seems to be very slow in general compared with similar dense models. Maybe teething issues..
OK maybe for a single user you can say it's borderline usable , but I mean if performance drops so much , then it makes little sense to use with something like 5090 (since it would wait most of the time swapping to system RAM) So in such case I think most would really be better off with smaller quants that fit in VRAM (and also leave enough space for KV cache) e.g. cyankiwi/gemma-4-31B-it-AWQ-4bit or other desent 4 or 5 bpw quants. And if you have more than 32GB VRAM then wouldn't e.g. cyankiwi/gemma-4-31B-it-AWQ-8bit be better quality?