nvidia/Nemotron-Cascade-2-30B-A3B · [Resolved] Unable to reproduce evals on AIME'25, AIME'26, HMMT Feb25

[Resolved] Unable to reproduce evals on AIME'25, AIME'26, HMMT Feb25

by chankhavu - opened 15 days ago

•

Hi Nemo team, thanks for this incredible model and fully open-sourced data and training recipe. I've been trying to reproduce your evals using nemo-evaluator-launcher, but getting the numbers far below reported:

Benchmark	reproduced results	reported in Cascade 2 docs
AIME 2025 with tools (avg@8)	88.3	98.6
AIME 2026 with tools (avg@8)	90.4	95.0
HMMT Feb 2025 with tools (avg@8)	81.3	94.6

Software/Hardware:

GPU: 2xRTX Pro 6000 Blackwell
Inference engine: SGLang v0.5.9 (latest)
Evals library: Nemo Evaluator Launcher 0.2.4

Here is my config:

defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: nel-results/cascade2_fp8
  mounts:
    evaluation:
      ./hf_cache: /root/.cache/huggingface
target:
  api_endpoint:
    model_id: nvidia/Nemotron-Cascade-2-30B-A3B
    url: http://<my-sglang-endpoint>/v1/chat/completions
    api_key_name: VAST_API_KEY

evaluation:
  env_vars:
    HF_TOKEN: host:HF_TOKEN
    HF_HOME: host:HF_HOME
  nemo_evaluator_config:
    config:
      params:
        parallelism: 16
        max_new_tokens: 131072
        temperature: 1.0
        top_p: 0.95
        request_timeout: 6000
        max_retries: 10
        extra:
          tokenizer_backend: huggingface
          tokenizer: nvidia/Nemotron-Cascade-2-30B-A3B
    target:
      api_endpoint:
        adapter_config:
          params_to_add: {"chat_template_kwargs": {"enable_thinking": true}, "skip_special_tokens": false}
          use_caching: true
          tracking_requests_stats: true
          log_failed_requests: true
          use_request_logging: true
          max_logged_requests: 10
          use_response_logging: true
          max_logged_responses: 10

  tasks:
  - name: nemo_skills.ns_aime2025
    nemo_evaluator_config:
      config:
        params:
          extra:
            use_sandbox: true
            num_repeats: 8
            args: "++inference.tokens_to_generate=null ++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool]"
  - name: nemo_skills.ns_aime2026
    nemo_evaluator_config:
      config:
        params:
          extra:
            use_sandbox: true
            num_repeats: 8
            args: "++inference.tokens_to_generate=null ++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool]"
  - name: nemo_skills.ns_hmmt_feb2025
    nemo_evaluator_config:
      config:
        params:
          extra:
            use_sandbox: true
            num_repeats: 8
            args: "++inference.tokens_to_generate=null ++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool]"

I execute it like this:

VAST_API_KEY=<token> HF_TOKEN=<token> HF_HOME="~/.cache/huggingface" nemo-evaluator-launcher run --config eval_cfgs/eval_cascade2_bf16.yaml

The SGLang server is launched with the following params:

python -m sglang.launch_server \
    --model nvidia/Nemotron-Cascade-2-30B-A3B \
    --trust-remote-code \
    --tool-call-parser qwen3_coder \
    --reasoning-parser nano_v3

iKetamine

15 days ago

so good!

ychenNLP

NVIDIA org 15 days ago

•

edited 15 days ago

Hi @chankhavu ,

Thanks for your effort!
Here is my Nemo-Skills (https://github.com/NVIDIA-NeMo/Skills) python script to reproduce the AIME25 number:

from nemo_skills.pipeline.cli import eval, wrap_arguments

cluster = "slurm"

eval(
    ctx=wrap_arguments(
        "++inference.tokens_to_generate=131072 "
        "++inference.temperature=1.0 "
        "++inference.top_p=0.95 "
        "++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "
    ),
    cluster=cluster,
    expname="debug",
    model="nvidia/Nemotron-Cascade-2-30B-A3B",
    server_type='vllm',
    server_container='vllm/vllm-openai:v0.14.1',
    server_gpus=1,
    num_chunks=1,
    with_sandbox=True,
    benchmarks=f"aime25:8",
    server_args="--mamba_ssm_cache_dtype float32 --no-enable-prefix-caching --enable-auto-tool-choice --tool-call-parser qwen3_coder",
    output_dir="<OUTPUT_DIR>"
)


# Results
---------------------------------------- aime25 ----------------------------------------
evaluation_mode  | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
pass@1[avg-of-8] | 30          | 12582      | 827         | 98.75% ± 1.73%   | 0.00%    
majority@8       | 30          | 12582      | 827         | 100.00%          | 0.00%    
pass@8           | 30          | 12582      | 827         | 100.00%          | 0.00%

Are you able to reproduce the number with no tool use? this helps to ablate the tool use issue
Can you try vLLM server? this helps to ablate the server issue

Thanks.

chankhavu

15 days ago

Thanks for your quick response, @ychenNLP ! Indeed, switching to vLLM with your exact parameters works. Here is my results on AIME'25, using nemo-evaluator-launcher with the same YAML config in my post above:

evaluation_mode	num_entries	avg_tokens	gen_seconds	symbolic_correct	no_answer
pass@1[avg-of-8]	30	11494	3330	99.17% ± 1.54%	0.00%
majority@8	30	11494	3330	100.00%	0.00%
pass@8	30	11494	3330	100.00%	0.00%

Differences with SGLang / default command from Nemotron-3-Nano vLLM cookbook:

Added --mamba_ssm_cache_dtype float32 -- this might be the main reason, will ablate on this parameter when I have time later this evening
Removed --reasoning-parser nemotron_v3 -- I don't think this has anything to do with perf increase

My full vLLM command:

vllm serve nvidia/Nemotron-Cascade-2-30B-A3B
  --max-model-len 262144 \
  --trust-remote-code \
  --mamba_ssm_cache_dtype float32 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

ychenNLP

NVIDIA org 15 days ago

•

edited 15 days ago

Great to hear!
For vllm, --mamba_ssm_cache_dtype float32 is a crucial config for this model.
For SGlang, --mamba-ssm-dtype float32 might be important.

chankhavu

15 days ago

•

edited 15 days ago

Thanks a lot, @ychenNLP !

I was able to confirm that the selective quantization recipe of Nano 30b (from the Nemotron 3 Nano Technical Report) works perfectly for Cascade 2 as well:

Benchmark	BF16 (reproduced)	FP8	NVFP4
AIME 2025 (avg@8)	98.8	96.7	97.9
AIME 2026 (avg@8)	94.2	95.0	92.1
HMMT Feb 2025 (avg@8)	92.9	93.8	90.1

With 8 rollouts per problem, ±2% deviation across runs is expected. FP8 is equivalent to BF16. NVFP4 is consistently 1-2% below BF16.

ychenNLP

NVIDIA org 14 days ago

•

edited 14 days ago

@chankhavu Thanks a lot for validating this and for sharing the follow-up results. It looks like the problem is resolved.
When you have a moment, could you please update the issue title accordingly and close it? Really appreciate it.

chankhavu changed discussion title from Unable to reproduce evals on AIME'25, AIME'26, HMMT Feb25 to [Resolved] Unable to reproduce evals on AIME'25, AIME'26, HMMT Feb25 14 days ago

chankhavu changed discussion status to closed 14 days ago

ychenNLP

NVIDIA org 5 days ago

•

edited 5 days ago

@chankhavu In case you want to reproduce the no tool use setting for IMO-AnswerBench:
https://huggingface.co/nvidia/Nemotron-Cascade-2-30B-A3B/discussions/24

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment