model hallucinates after number of tokens

#3
by swasslikeme - opened

vllm recipe:

Default settings (can be overridden via CLI)

defaults:
port: 8355
host: 0.0.0.0
tensor_parallel: 1
gpu_memory_utilization: 0.9
max_model_len: 196608
max_num_seqs: 64

Environment variables

env:
VLLM_NVFP4_GEMM_BACKEND: "marlin"
VLLM_TEST_FORCE_FP8_MARLIN: "1"
VLLM_MARLIN_USE_ATOMIC_ADD: "1"
VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1"
VLLM_USE_FLASHINFER_MOE_FP4: "0"

The vLLM serve command template

command: |
vllm serve saricles/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10
--port {port}
--host {host}
--served-model-name minimax
--gpu-memory-utilization {gpu_memory_utilization}
-tp {tensor_parallel}
--max-model-len {max_model_len}
--max-num-seqs {max_num_seqs}
--load-format fastsafetensors
--enable-auto-tool-choice
--trust-remote-code
--load-format fastsafetensors
--attention-backend flashinfer
--trust-remote-code
--disable-custom-all-reduce
--kv-cache-dtype fp8
--tool-call-parser minimax_m2
--reasoning-parser minimax_m2_append_think

curl command: curl -s http://localhost:8355/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"minimax", "messages":[{"role":"user","content":"Write me a roughly 3000 word story."}], "max_tokens":3000}'

Model hallucinates after some number of tokens, whereas it works with the upstream REAP 139B (lukealonso/MiniMax-M2.5-REAP-139B-A10B-NVFP4). It goes from some level of coherence to completely incoherent, and even beginning to just repeat the same word over and over.

Sign up or log in to comment