[Resolved] Unable to reproduce evals on AIME'25, AIME'26, HMMT Feb25

#8
by chankhavu - opened

Hi Nemo team, thanks for this incredible model and fully open-sourced data and training recipe. I've been trying to reproduce your evals using nemo-evaluator-launcher, but getting the numbers far below reported:

Benchmark reproduced results reported in Cascade 2 docs
AIME 2025 with tools (avg@8) 88.3 98.6
AIME 2026 with tools (avg@8) 90.4 95.0
HMMT Feb 2025 with tools (avg@8) 81.3 94.6

Software/Hardware:

  • GPU: 2xRTX Pro 6000 Blackwell
  • Inference engine: SGLang v0.5.9 (latest)
  • Evals library: Nemo Evaluator Launcher 0.2.4

Here is my config:

defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: nel-results/cascade2_fp8
  mounts:
    evaluation:
      ./hf_cache: /root/.cache/huggingface
target:
  api_endpoint:
    model_id: nvidia/Nemotron-Cascade-2-30B-A3B
    url: http://<my-sglang-endpoint>/v1/chat/completions
    api_key_name: VAST_API_KEY

evaluation:
  env_vars:
    HF_TOKEN: host:HF_TOKEN
    HF_HOME: host:HF_HOME
  nemo_evaluator_config:
    config:
      params:
        parallelism: 16
        max_new_tokens: 131072
        temperature: 1.0
        top_p: 0.95
        request_timeout: 6000
        max_retries: 10
        extra:
          tokenizer_backend: huggingface
          tokenizer: nvidia/Nemotron-Cascade-2-30B-A3B
    target:
      api_endpoint:
        adapter_config:
          params_to_add: {"chat_template_kwargs": {"enable_thinking": true}, "skip_special_tokens": false}
          use_caching: true
          tracking_requests_stats: true
          log_failed_requests: true
          use_request_logging: true
          max_logged_requests: 10
          use_response_logging: true
          max_logged_responses: 10

  tasks:
  - name: nemo_skills.ns_aime2025
    nemo_evaluator_config:
      config:
        params:
          extra:
            use_sandbox: true
            num_repeats: 8
            args: "++inference.tokens_to_generate=null ++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool]"
  - name: nemo_skills.ns_aime2026
    nemo_evaluator_config:
      config:
        params:
          extra:
            use_sandbox: true
            num_repeats: 8
            args: "++inference.tokens_to_generate=null ++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool]"
  - name: nemo_skills.ns_hmmt_feb2025
    nemo_evaluator_config:
      config:
        params:
          extra:
            use_sandbox: true
            num_repeats: 8
            args: "++inference.tokens_to_generate=null ++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool]"

I execute it like this:

VAST_API_KEY=<token> HF_TOKEN=<token> HF_HOME="~/.cache/huggingface" nemo-evaluator-launcher run --config eval_cfgs/eval_cascade2_bf16.yaml

The SGLang server is launched with the following params:

python -m sglang.launch_server \
    --model nvidia/Nemotron-Cascade-2-30B-A3B \
    --trust-remote-code \
    --tool-call-parser qwen3_coder \
    --reasoning-parser nano_v3

Hi @chankhavu ,

Thanks for your effort!
Here is my Nemo-Skills (https://github.com/NVIDIA-NeMo/Skills) python script to reproduce the AIME25 number:

from nemo_skills.pipeline.cli import eval, wrap_arguments

cluster = "slurm"

eval(
    ctx=wrap_arguments(
        "++inference.tokens_to_generate=131072 "
        "++inference.temperature=1.0 "
        "++inference.top_p=0.95 "
        "++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "
    ),
    cluster=cluster,
    expname="debug",
    model="nvidia/Nemotron-Cascade-2-30B-A3B",
    server_type='vllm',
    server_container='vllm/vllm-openai:v0.14.1',
    server_gpus=1,
    num_chunks=1,
    with_sandbox=True,
    benchmarks=f"aime25:8",
    server_args="--mamba_ssm_cache_dtype float32 --no-enable-prefix-caching --enable-auto-tool-choice --tool-call-parser qwen3_coder",
    output_dir="<OUTPUT_DIR>"
)


# Results
---------------------------------------- aime25 ----------------------------------------
evaluation_mode  | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
pass@1[avg-of-8] | 30          | 12582      | 827         | 98.75% Β± 1.73%   | 0.00%    
majority@8       | 30          | 12582      | 827         | 100.00%          | 0.00%    
pass@8           | 30          | 12582      | 827         | 100.00%          | 0.00%    
  1. Are you able to reproduce the number with no tool use? this helps to ablate the tool use issue
  2. Can you try vLLM server? this helps to ablate the server issue

Thanks.

Thanks for your quick response, @ychenNLP ! Indeed, switching to vLLM with your exact parameters works. Here is my results on AIME'25, using nemo-evaluator-launcher with the same YAML config in my post above:

evaluation_mode num_entries avg_tokens gen_seconds symbolic_correct no_answer
pass@1[avg-of-8] 30 11494 3330 99.17% Β± 1.54% 0.00%
majority@8 30 11494 3330 100.00% 0.00%
pass@8 30 11494 3330 100.00% 0.00%

Differences with SGLang / default command from Nemotron-3-Nano vLLM cookbook:

  • Added --mamba_ssm_cache_dtype float32 -- this might be the main reason, will ablate on this parameter when I have time later this evening
  • Removed --reasoning-parser nemotron_v3 -- I don't think this has anything to do with perf increase

My full vLLM command:

vllm serve nvidia/Nemotron-Cascade-2-30B-A3B
  --max-model-len 262144 \
  --trust-remote-code \
  --mamba_ssm_cache_dtype float32 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Great to hear!
For vllm, --mamba_ssm_cache_dtype float32 is a crucial config for this model.
For SGlang, --mamba-ssm-dtype float32 might be important.

Thanks a lot, @ychenNLP !

I was able to confirm that the selective quantization recipe of Nano 30b (from the Nemotron 3 Nano Technical Report) works perfectly for Cascade 2 as well:

Benchmark BF16 (reproduced) FP8 NVFP4
AIME 2025 (avg@8) 98.8 96.7 97.9
AIME 2026 (avg@8) 94.2 95.0 92.1
HMMT Feb 2025 (avg@8) 92.9 93.8 90.1

With 8 rollouts per problem, Β±2% deviation across runs is expected. FP8 is equivalent to BF16. NVFP4 is consistently 1-2% below BF16.

@chankhavu Thanks a lot for validating this and for sharing the follow-up results. It looks like the problem is resolved.
When you have a moment, could you please update the issue title accordingly and close it? Really appreciate it.

chankhavu changed discussion title from Unable to reproduce evals on AIME'25, AIME'26, HMMT Feb25 to [Resolved] Unable to reproduce evals on AIME'25, AIME'26, HMMT Feb25
chankhavu changed discussion status to closed

@chankhavu In case you want to reproduce the no tool use setting for IMO-AnswerBench:
https://huggingface.co/nvidia/Nemotron-Cascade-2-30B-A3B/discussions/24

Sign up or log in to comment