187 tok/s on RTX 3090, 625K Context, Agent Coding (IQ4_XS + Hermes Agent)

#20
by ychenNLP - opened

Sharing a practical deployment setup spotted on Twitter that demonstrates impressive real-world performance of Nemotron Cascade 2 30B-A3B. Credit to @sudoingX for the original benchmarks and testing.

What Happened

@sudoingX pointed Hermes Agent at this model running on a single RTX 3090 (IQ4_XS quant), had it discover its own hardware, create an identity file, and then build a full GPU marketplace UI — all from a single prompt. First attempt, no iteration, one-shot. For comparison, Qwen 3.5 35B-A3B on the same hardware and same type of task needed an extra iteration to recover from a blank screen.

Setup

  • GPU: Single RTX 3090 24GB
  • Quantization: IQ4_XS (by bartowski)
  • Engine: llama.cpp (compiled from source)
  • Flags: -ngl 99 -np 1 — no context flags, no KV cache tricks
  • Agent Harness: Hermes Agent (open source, by NousResearch)

Results

  • 187 tok/s, flat from 4K to 625K context, zero speed loss
  • Flash attention auto-enables, context auto-allocates to 625K — minimal config needed
  • One-shotted a full GPU marketplace UI from a single prompt, no iteration
  • Tool calling works reliably when paired with a compatible harness (Hermes Agent has per-model parsers for native tool call handling)

vs Qwen 3.5 35B-A3B on the Same Hardware

Same RTX 3090, same active parameters, 24 days apart in release:

Nemotron Cascade 2 Qwen 3.5 35B-A3B
tok/s 187 112
Max context 625K 262K
Extra flags needed None KV cache quantization
One-shot agent coding Yes Needed iteration to recover from blank screen

~67% faster generation with fewer flags on identical hardware.

A Note on Harness Compatibility

The original tester noted that this model may underperform with generic harnesses (e.g. tool call parsing issues). Pairing with Hermes Agent resolved these issues — same model, same hardware, significantly better results. Worth keeping in mind when evaluating.

Try It and Share Your Results!

If you've run Cascade 2 on any hardware, please drop your numbers and experience below:

  • GPU
  • Quant & Engine
  • tok/s and max context tested
  • Use case (chat, agent/tool-calling, coding, etc.)
  • Screenshots or demos if you have them!

References

Screenshot 2026-03-27 at 12.37.49 PM

ychenNLP changed discussion title from Real-World Setup: 187 tok/s on RTX 3090, 625K Context, One-Shot Agent Coding (IQ4_XS + Hermes Agent) to 187 tok/s on RTX 3090, 625K Context, Agent Coding (IQ4_XS + Hermes Agent)
ychenNLP pinned discussion
ychenNLP unpinned discussion

88 t/s at MLX-4BIT, on M4 PRO (48GB), but all output is bad, junk.

105-109 t/s on IQ4_XS at R9700 same flags.

image

122 t/s on 7900XTX at IQ4_XS

image

Here are my results, let know what you think, this is my first time spending a lot of effort quantizing a model, I've been doing llama.cpp NVFP4 work for a very long time and have lots to share.

Model: michaelw9999/Nemotron-Cascade-2-30B-A3B-NVFP4-GGUF
Size: 20.7 GB - Moved some tensors from BF16 to Q8 to cut down on size
Hardware: SM120 RTX 5090 32GB
Platform/Kernel: My PR on llama.cpp #21074 (Native Blackwell MMA version coming soon)
Quantizer: My llama-quantizer using imatrix, still a work in progress, would love feedback if this model works well for anyone or if I should make changes
Eval Benchmarks not completing correctly for me yet on default configuration, it's my first time trying to set those up, the answers were correct in reasoning output but it would cutoff before finishing, need to figure out that issue.

prompt eval time =    5496.39 ms / 40782 tokens (    0.13 ms per token,  7419.78 tokens per second)
       eval time =    5471.31 ms /  1106 tokens (    4.95 ms per token,   202.15 tokens per second)
      total time =   10967.70 ms / 41888 tokens

prompt eval time =      72.94 ms /   220 tokens (    0.33 ms per token,  3016.38 tokens per second)
       eval time =     135.60 ms /    30 tokens (    4.52 ms per token,   221.24 tokens per second)
      total time =     208.54 ms /   250 tokens

With llama-bench:

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 32606 MiB):
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32606 MiB
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |     128 |      128 |  0 |           pp256 |      2914.76 ± 10.52 |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |     128 |      128 |  1 |           tg128 |        229.52 ± 0.29 |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |     128 |      128 |  1 |  pp2048 @ d2048 |      2881.25 ± 15.60 |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |     128 |      128 |  1 |   tg128 @ d2048 |        231.26 ± 1.66 |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |     128 |      128 |  1 |   pp512 @ d8192 |      2856.47 ± 14.79 |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |     128 |      256 |  1 |   tg128 @ d8192 |        224.18 ± 1.67 |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |     128 |      128 |  1 | pp1024 @ d16384 |      2818.49 ± 23.84 |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |     128 |      256 |  1 |  tg128 @ d16384 |        221.46 ± 1.24 |
:~/llama.cpp-nvfp4-mmq-mma(nvfp4-mmq-mma)$ ./build/bin/llama-bench -m /mnt/c/Nemotron-Cascade-2-30B-A3B-NVFP4-GGUF.gguf 
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 32606 MiB):
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32606 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |           pp512 |      7846.34 ± 41.91 |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |           tg128 |        226.91 ± 1.13 |

build: f63b6f6ca (8556)
NVIDIA org

Great contributions, thanks for sharing! Compiling the numbers so far into one table:

User GPU Quant Engine tok/s (tg) Notes
@sudoingX RTX 3090 IQ4_XS llama.cpp (source) 187 625K context, flat speed, one-shot agent coding
@djdeniro RX 7900 XTX IQ4_XS llama.cpp 122
@djdeniro RX 9700 IQ4_XS llama.cpp 105–109
@djdeniro M4 Pro 48GB MLX-4BIT MLX 88 Output quality issues — may be a quant/engine problem
@michaelw9999 RTX 5090 NVFP4 llama.cpp (custom PR #21074) ~221–230 Native Blackwell MMA coming soon, pp512 at 7846 t/s

Not through lamabench, but here's what i'm seeing from hermes

gpu: RTX PRO 6000 Blackwell Max-Q
Quant: BF16
Engine: vLLM
tok/s: 182
pp: 7-14k/s

nemotron-cascade-2-30b-a3b_02  | (APIServer pid=1) INFO 03-28 23:09:34 [loggers.py:259] Engine 000: Avg prompt throughput: 5075.5 tokens/s, Avg generation throughput: 116.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.6%, Prefix cache hit rate: 0.0%
nemotron-cascade-2-30b-a3b_02  | (APIServer pid=1) INFO 03-28 23:09:44 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 179.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.7%, Prefix cache hit rate: 0.0%

With 4x7900xtx and original non quantized model, at VLLM got 88 t/s of generation throughput and 3-5k Avg prompt throughput (no cache)

Sign up or log in to comment