这个版本对于5090单卡来说还是太大了

by iwaitu - opened 4 days ago

Discussion

iwaitu

4 days ago

nvfp4版本32gb对于5090单卡来说，还是太大了，能不能将更多的网络转为nvfp4实现5090单卡推理？

alexcardo

4 days ago

nvfp4版本32gb对于5090单卡来说，还是太大了，能不能将更多的网络转为nvfp4实现5090单卡推理？

Same question )

AlexR7

4 days ago

Same, its barely as large as fp8, what s the point of this nvfp4 quantization?
确实，这个版本差不多和fp8量化版本一样大了，nvidia在搞什么？只准备给RTX Pro 6000用吗？

windane

3 days ago

I thought I could defy the odds, but I failed after all.

less0852

3 days ago

DGX spark fail too

windane

3 days ago

I thought I could defy the odds, but I failed after all.

Load success through --cpu-offload-gb 10, but the speed... [sad]

Gratham

3 days ago

DGX spark fail too

It didnt work on the Spark?

Tugay31

3 days ago

The model size was intentionally set so that it wouldn't work well with 32GB GPUs. Perhaps they want to sell more RTX Pro 6000s.

sudo-0x2a

2 days ago

这个quant有问题，量化了个寂寞

rividevano19

2 days ago

DGX spark fail too

It didnt work on the Spark?

I ran it on dgx spark, and it works,

my system boots with init 3 so it is in headless mode.
use docker for

docker run --runtime nvidia --gpus all -it  --rm -d --env "HF_HUB_OFFLINE=1"  \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:gemma4-cu130 --model nvidia/Gemma-4-31B-IT-NVFP4 --quantization modelopt --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4

the model takes all the mem though.

I dont have quantitative results but the outputs on images about complex procedural graphs or scientific images work, and the details (eg numbers) from the images are accurate.

Currently, the included tool call parser (docker image 9afe08ebfa30) is not right and may format the args for tools incorrectly.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment