[Resolved] Unable to reproduce evals on AIME'25, AIME'26, HMMT Feb25
Hi Nemo team, thanks for this incredible model and fully open-sourced data and training recipe. I've been trying to reproduce your evals using nemo-evaluator-launcher, but getting the numbers far below reported:
| Benchmark | reproduced results | reported in Cascade 2 docs |
|---|---|---|
| AIME 2025 with tools (avg@8) | 88.3 | 98.6 |
| AIME 2026 with tools (avg@8) | 90.4 | 95.0 |
| HMMT Feb 2025 with tools (avg@8) | 81.3 | 94.6 |
Software/Hardware:
- GPU: 2xRTX Pro 6000 Blackwell
- Inference engine: SGLang v0.5.9 (latest)
- Evals library: Nemo Evaluator Launcher 0.2.4
Here is my config:
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: nel-results/cascade2_fp8
mounts:
evaluation:
./hf_cache: /root/.cache/huggingface
target:
api_endpoint:
model_id: nvidia/Nemotron-Cascade-2-30B-A3B
url: http://<my-sglang-endpoint>/v1/chat/completions
api_key_name: VAST_API_KEY
evaluation:
env_vars:
HF_TOKEN: host:HF_TOKEN
HF_HOME: host:HF_HOME
nemo_evaluator_config:
config:
params:
parallelism: 16
max_new_tokens: 131072
temperature: 1.0
top_p: 0.95
request_timeout: 6000
max_retries: 10
extra:
tokenizer_backend: huggingface
tokenizer: nvidia/Nemotron-Cascade-2-30B-A3B
target:
api_endpoint:
adapter_config:
params_to_add: {"chat_template_kwargs": {"enable_thinking": true}, "skip_special_tokens": false}
use_caching: true
tracking_requests_stats: true
log_failed_requests: true
use_request_logging: true
max_logged_requests: 10
use_response_logging: true
max_logged_responses: 10
tasks:
- name: nemo_skills.ns_aime2025
nemo_evaluator_config:
config:
params:
extra:
use_sandbox: true
num_repeats: 8
args: "++inference.tokens_to_generate=null ++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool]"
- name: nemo_skills.ns_aime2026
nemo_evaluator_config:
config:
params:
extra:
use_sandbox: true
num_repeats: 8
args: "++inference.tokens_to_generate=null ++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool]"
- name: nemo_skills.ns_hmmt_feb2025
nemo_evaluator_config:
config:
params:
extra:
use_sandbox: true
num_repeats: 8
args: "++inference.tokens_to_generate=null ++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool]"
I execute it like this:
VAST_API_KEY=<token> HF_TOKEN=<token> HF_HOME="~/.cache/huggingface" nemo-evaluator-launcher run --config eval_cfgs/eval_cascade2_bf16.yaml
The SGLang server is launched with the following params:
python -m sglang.launch_server \
--model nvidia/Nemotron-Cascade-2-30B-A3B \
--trust-remote-code \
--tool-call-parser qwen3_coder \
--reasoning-parser nano_v3
so good!
Hi @chankhavu ,
Thanks for your effort!
Here is my Nemo-Skills (https://github.com/NVIDIA-NeMo/Skills) python script to reproduce the AIME25 number:
from nemo_skills.pipeline.cli import eval, wrap_arguments
cluster = "slurm"
eval(
ctx=wrap_arguments(
"++inference.tokens_to_generate=131072 "
"++inference.temperature=1.0 "
"++inference.top_p=0.95 "
"++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "
),
cluster=cluster,
expname="debug",
model="nvidia/Nemotron-Cascade-2-30B-A3B",
server_type='vllm',
server_container='vllm/vllm-openai:v0.14.1',
server_gpus=1,
num_chunks=1,
with_sandbox=True,
benchmarks=f"aime25:8",
server_args="--mamba_ssm_cache_dtype float32 --no-enable-prefix-caching --enable-auto-tool-choice --tool-call-parser qwen3_coder",
output_dir="<OUTPUT_DIR>"
)
# Results
---------------------------------------- aime25 ----------------------------------------
evaluation_mode | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
pass@1[avg-of-8] | 30 | 12582 | 827 | 98.75% Β± 1.73% | 0.00%
majority@8 | 30 | 12582 | 827 | 100.00% | 0.00%
pass@8 | 30 | 12582 | 827 | 100.00% | 0.00%
- Are you able to reproduce the number with no tool use? this helps to ablate the tool use issue
- Can you try vLLM server? this helps to ablate the server issue
Thanks.
Thanks for your quick response, @ychenNLP ! Indeed, switching to vLLM with your exact parameters works. Here is my results on AIME'25, using nemo-evaluator-launcher with the same YAML config in my post above:
| evaluation_mode | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer |
|---|---|---|---|---|---|
| pass@1[avg-of-8] | 30 | 11494 | 3330 | 99.17% Β± 1.54% | 0.00% |
| majority@8 | 30 | 11494 | 3330 | 100.00% | 0.00% |
| pass@8 | 30 | 11494 | 3330 | 100.00% | 0.00% |
Differences with SGLang / default command from Nemotron-3-Nano vLLM cookbook:
- Added
--mamba_ssm_cache_dtype float32-- this might be the main reason, will ablate on this parameter when I have time later this evening - Removed
--reasoning-parser nemotron_v3-- I don't think this has anything to do with perf increase
My full vLLM command:
vllm serve nvidia/Nemotron-Cascade-2-30B-A3B
--max-model-len 262144 \
--trust-remote-code \
--mamba_ssm_cache_dtype float32 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Great to hear!
For vllm, --mamba_ssm_cache_dtype float32 is a crucial config for this model.
For SGlang, --mamba-ssm-dtype float32 might be important.
Thanks a lot, @ychenNLP !
I was able to confirm that the selective quantization recipe of Nano 30b (from the Nemotron 3 Nano Technical Report) works perfectly for Cascade 2 as well:
| Benchmark | BF16 (reproduced) | FP8 | NVFP4 |
|---|---|---|---|
| AIME 2025 (avg@8) | 98.8 | 96.7 | 97.9 |
| AIME 2026 (avg@8) | 94.2 | 95.0 | 92.1 |
| HMMT Feb 2025 (avg@8) | 92.9 | 93.8 | 90.1 |
With 8 rollouts per problem, Β±2% deviation across runs is expected. FP8 is equivalent to BF16. NVFP4 is consistently 1-2% below BF16.
@chankhavu Thanks a lot for validating this and for sharing the follow-up results. It looks like the problem is resolved.
When you have a moment, could you please update the issue title accordingly and close it? Really appreciate it.
@chankhavu In case you want to reproduce the no tool use setting for IMO-AnswerBench:
https://huggingface.co/nvidia/Nemotron-Cascade-2-30B-A3B/discussions/24