Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -193,6 +193,33 @@ Local single-GPU measurements from the validated local engine on `RTX 5070 Lapto
|
|
| 193 |
|
| 194 |
These numbers are local measurements from one machine and should be treated as reference values, not portability guarantees.
|
| 195 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 196 |
## Notes
|
| 197 |
|
| 198 |
- This is not an official Microsoft or NVIDIA release.
|
|
|
|
| 193 |
|
| 194 |
These numbers are local measurements from one machine and should be treated as reference values, not portability guarantees.
|
| 195 |
|
| 196 |
+
## Quick Parity Check
|
| 197 |
+
|
| 198 |
+
A small public sanity-check was run against the upstream Hugging Face baseline on `20` validation examples from `ARC-Challenge` and `20` validation examples from `OpenBookQA`.
|
| 199 |
+
|
| 200 |
+
| Benchmark | HF baseline | TRT FP8 | Agreement |
|
| 201 |
+
|---|---:|---:|---:|
|
| 202 |
+
| `ARC-Challenge` | `0.90` | `0.90` | `0.90` |
|
| 203 |
+
| `OpenBookQA` | `0.80` | `0.80` | `0.90` |
|
| 204 |
+
| `Overall` | `0.85` | `0.85` | `0.90` |
|
| 205 |
+
|
| 206 |
+
This is only a quick local parity check, not a full benchmark suite. It is intended to show the practical tradeoff of this conversion on a small public subset.
|
| 207 |
+
|
| 208 |
+
## FP8 vs NVFP4
|
| 209 |
+
|
| 210 |
+
The table below compares two locally validated TensorRT-LLM variants built for the same GPU family and the same local engine limits (`max_batch_size=1`, `max_seq_len=512`, `max_num_tokens=128`).
|
| 211 |
+
|
| 212 |
+
| Variant | Checkpoint | Engine | `short_chat_42_64` | `balanced_64_64` | `long_generation_32_96` | Quick-check overall | Quick-check change vs TRT FP8 | Practical reading |
|
| 213 |
+
|---|---:|---:|---:|---:|---:|---:|---|---|
|
| 214 |
+
| `FP8` | `5.3 GB` | `5.4 GB` | `54.88 tok/s` | `56.68 tok/s` | `55.68 tok/s` | `0.85` | `baseline` | Best balance in these local tests |
|
| 215 |
+
| `NVFP4` | `4.0 GB` | `3.0 GB` | `78.33 tok/s` | `78.24 tok/s` | `82.27 tok/s` | `0.775` | `-7.5 pts on this quick-check` | Faster and smaller, but with a visible quality drop |
|
| 216 |
+
|
| 217 |
+
This comparison is intentionally local and narrow. It should not be treated as a universal benchmark across all prompts, datasets, GPUs, or TensorRT-LLM versions.
|
| 218 |
+
|
| 219 |
+
HF baseline on that same `40`-question subset: `0.85`.
|
| 220 |
+
|
| 221 |
+
On that same `40`-question subset, the local TensorRT-LLM FP8 engine also scored `0.85`.
|
| 222 |
+
|
| 223 |
## Notes
|
| 224 |
|
| 225 |
- This is not an official Microsoft or NVIDIA release.
|