这个版本对于5090单卡来说还是太大了
nvfp4版本32gb对于5090单卡来说,还是太大了,能不能将更多的网络转为nvfp4实现5090单卡推理?
nvfp4版本32gb对于5090单卡来说,还是太大了,能不能将更多的网络转为nvfp4实现5090单卡推理?
Same question )
Same, its barely as large as fp8, what s the point of this nvfp4 quantization?
确实,这个版本差不多和fp8量化版本一样大了,nvidia在搞什么? 只准备给RTX Pro 6000用吗?
DGX spark fail too
DGX spark fail too
It didnt work on the Spark?
The model size was intentionally set so that it wouldn't work well with 32GB GPUs. Perhaps they want to sell more RTX Pro 6000s.
这个quant有问题,量化了个寂寞
DGX spark fail too
It didnt work on the Spark?
I ran it on dgx spark, and it works,
my system boots with init 3 so it is in headless mode.
use docker for
docker run --runtime nvidia --gpus all -it --rm -d --env "HF_HUB_OFFLINE=1" \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:gemma4-cu130 --model nvidia/Gemma-4-31B-IT-NVFP4 --quantization modelopt --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4
the model takes all the mem though.
I dont have quantitative results but the outputs on images about complex procedural graphs or scientific images work, and the details (eg numbers) from the images are accurate.
Currently, the included tool call parser (docker image 9afe08ebfa30) is not right and may format the args for tools incorrectly.



