model hallucinates after number of tokens

by swasslikeme - opened 21 days ago

vllm recipe:

Default settings (can be overridden via CLI)

defaults:
port: 8355
host: 0.0.0.0
tensor_parallel: 1
gpu_memory_utilization: 0.9
max_model_len: 196608
max_num_seqs: 64

Environment variables

env:
VLLM_NVFP4_GEMM_BACKEND: "marlin"
VLLM_TEST_FORCE_FP8_MARLIN: "1"
VLLM_MARLIN_USE_ATOMIC_ADD: "1"
VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1"
VLLM_USE_FLASHINFER_MOE_FP4: "0"

The vLLM serve command template

command: |
vllm serve saricles/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10
--port {port}
--host {host}
--served-model-name minimax
--gpu-memory-utilization {gpu_memory_utilization}
-tp {tensor_parallel}
--max-model-len {max_model_len}
--max-num-seqs {max_num_seqs}
--load-format fastsafetensors
--enable-auto-tool-choice
--trust-remote-code
--load-format fastsafetensors
--attention-backend flashinfer
--trust-remote-code
--disable-custom-all-reduce
--kv-cache-dtype fp8
--tool-call-parser minimax_m2
--reasoning-parser minimax_m2_append_think

curl command: curl -s http://localhost:8355/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"minimax", "messages":[{"role":"user","content":"Write me a roughly 3000 word story."}], "max_tokens":3000}'

Model hallucinates after some number of tokens, whereas it works with the upstream REAP 139B (lukealonso/MiniMax-M2.5-REAP-139B-A10B-NVFP4). It goes from some level of coherence to completely incoherent, and even beginning to just repeat the same word over and over.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment