nvidia/Nemotron-Cascade-2-30B-A3B · 187 tok/s on RTX 3090, 625K Context, Agent Coding (IQ4

187 tok/s on RTX 3090, 625K Context, Agent Coding (IQ4_XS + Hermes Agent)

#20

by ychenNLP - opened 12 days ago

Discussion

ychenNLP

NVIDIA org 12 days ago

•

edited 12 days ago

Sharing a practical deployment setup spotted on Twitter that demonstrates impressive real-world performance of Nemotron Cascade 2 30B-A3B. Credit to @sudoingX for the original benchmarks and testing.

What Happened

@sudoingX pointed Hermes Agent at this model running on a single RTX 3090 (IQ4_XS quant), had it discover its own hardware, create an identity file, and then build a full GPU marketplace UI — all from a single prompt. First attempt, no iteration, one-shot. For comparison, Qwen 3.5 35B-A3B on the same hardware and same type of task needed an extra iteration to recover from a blank screen.

Setup

GPU: Single RTX 3090 24GB
Quantization: IQ4_XS (by bartowski)
Engine: llama.cpp (compiled from source)
Flags: -ngl 99 -np 1 — no context flags, no KV cache tricks
Agent Harness: Hermes Agent (open source, by NousResearch)

Results

187 tok/s, flat from 4K to 625K context, zero speed loss
Flash attention auto-enables, context auto-allocates to 625K — minimal config needed
One-shotted a full GPU marketplace UI from a single prompt, no iteration
Tool calling works reliably when paired with a compatible harness (Hermes Agent has per-model parsers for native tool call handling)

vs Qwen 3.5 35B-A3B on the Same Hardware

Same RTX 3090, same active parameters, 24 days apart in release:

	Nemotron Cascade 2	Qwen 3.5 35B-A3B
tok/s	187	112
Max context	625K	262K
Extra flags needed	None	KV cache quantization
One-shot agent coding	Yes	Needed iteration to recover from blank screen

~67% faster generation with fewer flags on identical hardware.

A Note on Harness Compatibility

The original tester noted that this model may underperform with generic harnesses (e.g. tool call parsing issues). Pairing with Hermes Agent resolved these issues — same model, same hardware, significantly better results. Worth keeping in mind when evaluating.

Try It and Share Your Results!

If you've run Cascade 2 on any hardware, please drop your numbers and experience below:

GPU
Quant & Engine
tok/s and max context tested
Use case (chat, agent/tool-calling, coding, etc.)
Screenshots or demos if you have them!

References

ychenNLP changed discussion title from Real-World Setup: 187 tok/s on RTX 3090, 625K Context, One-Shot Agent Coding (IQ4_XS + Hermes Agent) to 187 tok/s on RTX 3090, 625K Context, Agent Coding (IQ4_XS + Hermes Agent) 12 days ago

ychenNLP pinned discussion 12 days ago

ychenNLP unpinned discussion 12 days ago

djdeniro

12 days ago

88 t/s at MLX-4BIT, on M4 PRO (48GB), but all output is bad, junk.

djdeniro

12 days ago

105-109 t/s on IQ4_XS at R9700 same flags.

djdeniro

12 days ago

122 t/s on 7900XTX at IQ4_XS

michaelw9999

12 days ago

•

edited 12 days ago

Here are my results, let know what you think, this is my first time spending a lot of effort quantizing a model, I've been doing llama.cpp NVFP4 work for a very long time and have lots to share.

Model: michaelw9999/Nemotron-Cascade-2-30B-A3B-NVFP4-GGUF
Size: 20.7 GB - Moved some tensors from BF16 to Q8 to cut down on size
Hardware: SM120 RTX 5090 32GB
Platform/Kernel: My PR on llama.cpp #21074 (Native Blackwell MMA version coming soon)
Quantizer: My llama-quantizer using imatrix, still a work in progress, would love feedback if this model works well for anyone or if I should make changes
Eval Benchmarks not completing correctly for me yet on default configuration, it's my first time trying to set those up, the answers were correct in reasoning output but it would cutoff before finishing, need to figure out that issue.

prompt eval time =    5496.39 ms / 40782 tokens (    0.13 ms per token,  7419.78 tokens per second)
       eval time =    5471.31 ms /  1106 tokens (    4.95 ms per token,   202.15 tokens per second)
      total time =   10967.70 ms / 41888 tokens

prompt eval time =      72.94 ms /   220 tokens (    0.33 ms per token,  3016.38 tokens per second)
       eval time =     135.60 ms /    30 tokens (    4.52 ms per token,   221.24 tokens per second)
      total time =     208.54 ms /   250 tokens

With llama-bench:

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 32606 MiB):
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32606 MiB
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |     128 |      128 |  0 |           pp256 |      2914.76 ± 10.52 |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |     128 |      128 |  1 |           tg128 |        229.52 ± 0.29 |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |     128 |      128 |  1 |  pp2048 @ d2048 |      2881.25 ± 15.60 |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |     128 |      128 |  1 |   tg128 @ d2048 |        231.26 ± 1.66 |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |     128 |      128 |  1 |   pp512 @ d8192 |      2856.47 ± 14.79 |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |     128 |      256 |  1 |   tg128 @ d8192 |        224.18 ± 1.67 |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |     128 |      128 |  1 | pp1024 @ d16384 |      2818.49 ± 23.84 |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |     128 |      256 |  1 |  tg128 @ d16384 |        221.46 ± 1.24 |

:~/llama.cpp-nvfp4-mmq-mma(nvfp4-mmq-mma)$ ./build/bin/llama-bench -m /mnt/c/Nemotron-Cascade-2-30B-A3B-NVFP4-GGUF.gguf 
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 32606 MiB):
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32606 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |           pp512 |      7846.34 ± 41.91 |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |           tg128 |        226.91 ± 1.13 |

build: f63b6f6ca (8556)

ychenNLP

NVIDIA org 11 days ago

Great contributions, thanks for sharing! Compiling the numbers so far into one table:

User	GPU	Quant	Engine	tok/s (tg)	Notes
@sudoingX	RTX 3090	IQ4_XS	llama.cpp (source)	187	625K context, flat speed, one-shot agent coding
@djdeniro	RX 7900 XTX	IQ4_XS	llama.cpp	122
@djdeniro	RX 9700	IQ4_XS	llama.cpp	105–109
@djdeniro	M4 Pro 48GB	MLX-4BIT	MLX	88	Output quality issues — may be a quant/engine problem
@michaelw9999	RTX 5090	NVFP4	llama.cpp (custom PR #21074)	~221–230	Native Blackwell MMA coming soon, pp512 at 7846 t/s

nottooshort

11 days ago

Not through lamabench, but here's what i'm seeing from hermes

gpu: RTX PRO 6000 Blackwell Max-Q
Quant: BF16
Engine: vLLM
tok/s: 182
pp: 7-14k/s

nemotron-cascade-2-30b-a3b_02  | (APIServer pid=1) INFO 03-28 23:09:34 [loggers.py:259] Engine 000: Avg prompt throughput: 5075.5 tokens/s, Avg generation throughput: 116.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.6%, Prefix cache hit rate: 0.0%
nemotron-cascade-2-30b-a3b_02  | (APIServer pid=1) INFO 03-28 23:09:44 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 179.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.7%, Prefix cache hit rate: 0.0%

djdeniro

10 days ago

With 4x7900xtx and original non quantized model, at VLLM got 88 t/s of generation throughput and 3-5k Avg prompt throughput (no cache)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment