vLLM does not load the eagle head
(APIServer pid=1) INFO 03-17 00:09:40 [model.py:533] Resolved architecture: EagleMistralLarge3ForCausalLM
(APIServer pid=1) ERROR 03-17 00:09:40 [repo_utils.py:47] Error retrieving safetensors: 'mistralai/Mistral-Small-4-119B-2603-eagle' is not a safetensors repo. Couldn't find 'model.safetensors.index.json' or 'model.safetensors' files., retrying 1 of 2
(APIServer pid=1) ERROR 03-17 00:09:43 [repo_utils.py:45] Error retrieving safetensors: 'mistralai/Mistral-Small-4-119B-2603-eagle' is not a safetensors repo. Couldn't find 'model.safetensors.index.json' or 'model.safetensors' files.
Currently the weights are named consolidated.safetensors
Thanks for reporting this, you can ignore this error raised, i'll track it down in a couple of days to remove this but vLLM will load the head. You were able to serve right ?
Yes the model runs without MTP.
I wanted to try to see the speed vs Nemotron (100 tok/s base, up to 170 tok/s with MTP on 2x RTX Pro 6000)
By the way, do you have KL-divergence stats on the NVFP4 model or tested speculative decoding with the NVFP4 model?
If the distribution is too different from the FP8 model, the Eagle head prediction would be off right?
I didn't test it myself nor computed KL divergence but they might have some mismatch especially for longer sequences. NVFP4 was calibrated after training on some selected data so it can capture parts of the distribution but unsure how much it recovered in comparison with base checkpoint for all cases.
If you do test it please share your results with the community, it'd be awesome to know it !
Tried using this the the NVFP4 Mistral 4 model on a DGX Spark. This is claude's summarization of the failure.
β The error is RuntimeError: unsupported 'a' scalar_type in marlin_gemm during the EAGLE draft model's MLA attention kv_b_proj call. The EAGLE head is inheriting the FP8 quantization
from the main model, but the input tensor dtype is incompatible with the Marlin kernel.
Hello. I am successfully running the NVFP4 version on a single DGX Spark (the only trick was to add a --no-deps when compiling the patched vllm, otherwise pip was too eager to replace my carefully selected torch libraries). Very good prompt processing speed, and, because the Spark has a weak memory bandwidth, generation speed around 10 toks/secs. However I hit a snag enabling speculative decoding (working on the Spark weak point). I get the following error:
'EagleMistralLarge3Model' object has no attribute 'aux_hidden_state_layers'
suggesting something is not up-to-date with transformers. But I installed from https://github.com/huggingface/transformers.git as instructed and I got version 5.3.0.dev0 which I believe should be the right one. Did I miss something?
Anyway, this model is awesome (following devstral-small-2 which was also awesome).
Hey @leonbottou try using nightly vLLM, the docker we gave is not up to date with respect to eagle sadly as I uncovered a bug at release time.
If reasoning/ tool parser fix is a must have to you then sadly you'd have to choose between the two for now. I'm working towards merging grammar onto vLLM as well with the maintainers team.
Thanks for the quick reply. It basically means "wait one or two weeks until all the patches make it into their respective repositories". The Spark does not help as it itself needs patched versions of everything. But the result is quite nice. Prompt processing easily goes in the 1000 toks/s, generation at 10 toks/s is clearly usable. I am expecting speculation to raise that around 25 toks/s maybe. Compared to devstral-small-2 ( my previous preferred model - I eat local ), both MOE and speculation have a positive impact on what the Spark does badly. So this new model makes this machine a lot more useful. Thanks again.
