Shoolife commited on
Commit
3a9b03f
·
verified ·
1 Parent(s): fac210c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +27 -0
README.md CHANGED
@@ -193,6 +193,33 @@ Local single-GPU measurements from the validated local engine on `RTX 5070 Lapto
193
 
194
  These numbers are local measurements from one machine and should be treated as reference values, not portability guarantees.
195
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
  ## Notes
197
 
198
  - This is not an official Microsoft or NVIDIA release.
 
193
 
194
  These numbers are local measurements from one machine and should be treated as reference values, not portability guarantees.
195
 
196
+ ## Quick Parity Check
197
+
198
+ A small public sanity-check was run against the upstream Hugging Face baseline on `20` validation examples from `ARC-Challenge` and `20` validation examples from `OpenBookQA`.
199
+
200
+ | Benchmark | HF baseline | TRT FP8 | Agreement |
201
+ |---|---:|---:|---:|
202
+ | `ARC-Challenge` | `0.90` | `0.90` | `0.90` |
203
+ | `OpenBookQA` | `0.80` | `0.80` | `0.90` |
204
+ | `Overall` | `0.85` | `0.85` | `0.90` |
205
+
206
+ This is only a quick local parity check, not a full benchmark suite. It is intended to show the practical tradeoff of this conversion on a small public subset.
207
+
208
+ ## FP8 vs NVFP4
209
+
210
+ The table below compares two locally validated TensorRT-LLM variants built for the same GPU family and the same local engine limits (`max_batch_size=1`, `max_seq_len=512`, `max_num_tokens=128`).
211
+
212
+ | Variant | Checkpoint | Engine | `short_chat_42_64` | `balanced_64_64` | `long_generation_32_96` | Quick-check overall | Quick-check change vs TRT FP8 | Practical reading |
213
+ |---|---:|---:|---:|---:|---:|---:|---|---|
214
+ | `FP8` | `5.3 GB` | `5.4 GB` | `54.88 tok/s` | `56.68 tok/s` | `55.68 tok/s` | `0.85` | `baseline` | Best balance in these local tests |
215
+ | `NVFP4` | `4.0 GB` | `3.0 GB` | `78.33 tok/s` | `78.24 tok/s` | `82.27 tok/s` | `0.775` | `-7.5 pts on this quick-check` | Faster and smaller, but with a visible quality drop |
216
+
217
+ This comparison is intentionally local and narrow. It should not be treated as a universal benchmark across all prompts, datasets, GPUs, or TensorRT-LLM versions.
218
+
219
+ HF baseline on that same `40`-question subset: `0.85`.
220
+
221
+ On that same `40`-question subset, the local TensorRT-LLM FP8 engine also scored `0.85`.
222
+
223
  ## Notes
224
 
225
  - This is not an official Microsoft or NVIDIA release.