Shoolife
/

Phi-4-mini-instruct-TensorRT-LLM-Checkpoint-FP8

@@ -193,6 +193,33 @@ Local single-GPU measurements from the validated local engine on `RTX 5070 Lapto
 These numbers are local measurements from one machine and should be treated as reference values, not portability guarantees.
 ## Notes
 - This is not an official Microsoft or NVIDIA release.

 These numbers are local measurements from one machine and should be treated as reference values, not portability guarantees.
+## Quick Parity Check
+A small public sanity-check was run against the upstream Hugging Face baseline on `20` validation examples from `ARC-Challenge` and `20` validation examples from `OpenBookQA`.
+| Benchmark | HF baseline | TRT FP8 | Agreement |
+|---|---:|---:|---:|
+| `ARC-Challenge` | `0.90` | `0.90` | `0.90` |
+| `OpenBookQA` | `0.80` | `0.80` | `0.90` |
+| `Overall` | `0.85` | `0.85` | `0.90` |
+This is only a quick local parity check, not a full benchmark suite. It is intended to show the practical tradeoff of this conversion on a small public subset.
+## FP8 vs NVFP4
+The table below compares two locally validated TensorRT-LLM variants built for the same GPU family and the same local engine limits (`max_batch_size=1`, `max_seq_len=512`, `max_num_tokens=128`).
+| Variant | Checkpoint | Engine | `short_chat_42_64` | `balanced_64_64` | `long_generation_32_96` | Quick-check overall | Quick-check change vs TRT FP8 | Practical reading |
+|---|---:|---:|---:|---:|---:|---:|---|---|
+| `FP8` | `5.3 GB` | `5.4 GB` | `54.88 tok/s` | `56.68 tok/s` | `55.68 tok/s` | `0.85` | `baseline` | Best balance in these local tests |
+| `NVFP4` | `4.0 GB` | `3.0 GB` | `78.33 tok/s` | `78.24 tok/s` | `82.27 tok/s` | `0.775` | `-7.5 pts on this quick-check` | Faster and smaller, but with a visible quality drop |
+This comparison is intentionally local and narrow. It should not be treated as a universal benchmark across all prompts, datasets, GPUs, or TensorRT-LLM versions.
+HF baseline on that same `40`-question subset: `0.85`.
+On that same `40`-question subset, the local TensorRT-LLM FP8 engine also scored `0.85`.
 ## Notes
 - This is not an official Microsoft or NVIDIA release.