Why is this 4bit version has a 32.7 GB size?

by alexcardo - opened 4 days ago

Discussion

alexcardo

4 days ago

I've been expecting its size around ~ 20GB to fit into 1 x RTX5090

The size is closer to 8bit btw.

KZHNB

4 days ago

Tensor type
BF16 F8_E4M3 U8

lol...

alexcardo

4 days ago

•

edited 4 days ago

Tensor type
BF16 F8_E4M3 U8

lol...

No matter... we want it smaller to fit in consumer-grade video cards )

tradespider

3 days ago

This 32.7GB build isn't a standard 4-bit quant; it’s a Compound AI Architecture optimized for the M5/RTX 6000 era. By leveraging Mixed-Precision Tensors, we maintain logic-critical layers in BF16, while offloading heavy computation to F8_E4M3 for a 4.5x throughput boost. The inclusion of U8-indexed KV Caching and speculative decoding allows for a staggering 400 t/s without the typical perplexity degradation of legacy INT4. When wrapped in a RAG + Validation Gates pipeline, this local engine effectively bridges the gap to frontier cloud models. High-density engineering for devs who prioritize local privacy without sacrificing 'GPT-5.2' class reasoning.

Tugay31

2 days ago

•

edited 2 days ago

The model size was intentionally set so that it wouldn't work well with 32GB GPUs. Perhaps they want to sell more RTX Pro 6000s.
try prithivMLmods/gemma-4-31B-it-NVFP4

Andyx1976

1 day ago

i mean if it's faster in general, that it's also still faster when vram swapping.
The real question is, how good is it still. That's the big claim up there. So if it's faster and still good, that's a net benefit.

All academic, because it is the fully censored (i.e. useless) base version anyway.

krzysztofma

1 day ago

•

edited 1 day ago

i mean if it's faster in general, that it's also still faster when vram swapping.
The real question is, how good is it still. That's the big claim up there. So if it's faster and still good, that's a net benefit.

All academic, because it is the fully censored (i.e. useless) base version anyway.

I don’t get it . Given it’s dense model , If it needs significant swapping/ offloading to system RAM on , say Rtx 5090 32GB , then( especially considering some additional VRAM needed for some KV cache) -it would be way slower than Rtx pro 6000 or theoretical 5090 with 48GB VRAM, right?

Also , why it’s useless? Genuine question. Does it hurt something like coding assistance?

excelle08

1 day ago

The model size was intentionally set so that it wouldn't work well with 32GB GPUs. Perhaps they want to sell more RTX Pro 6000s.
try prithivMLmods/gemma-4-31B-it-NVFP4

If I had a RTX Pro 6000 Blackwell I would just run the original version...

Andyx1976

about 4 hours ago

way slower yes. Usable? Yes. Depends how much you have to do with it. Since unified memory is all the rage, it can't be that bad.
But 31b seems to be very slow in general compared with similar dense models. Maybe teething issues..

alexcardo

about 3 hours ago

Finally, I discovered the unofficial NVFP4 qant (23GB). Tested it out. It works pretty well on 1 x RTX 5090 and vLLM.

But still, I can't realize why NVIDIA can't release their official NVFP4 (4biT!!!!!!!!!) qant of this astonishing model that we could be sure that everything is quantized properly. Come on, NVIDIA, the community is waiting. For you is 2ms of work, for us it's a lifebuoy!

krzysztofma

about 1 hour ago

way slower yes. Usable? Yes. Depends how much you have to do with it. Since unified memory is all the rage, it can't be that bad.
But 31b seems to be very slow in general compared with similar dense models. Maybe teething issues..

OK maybe for a single user you can say it's borderline usable , but I mean if performance drops so much , then it makes little sense to use with something like 5090 (since it would wait most of the time swapping to system RAM) So in such case I think most would really be better off with smaller quants that fit in VRAM (and also leave enough space for KV cache) e.g. cyankiwi/gemma-4-31B-it-AWQ-4bit or other desent 4 or 5 bpw quants. And if you have more than 32GB VRAM then wouldn't e.g. cyankiwi/gemma-4-31B-it-AWQ-8bit be better quality?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment