Llama-3.1-8B-Instruct TensorRT-LLM checkpoint (FP8 weight + FP8 KV)
TensorRT-LLM checkpoint for Llama-3.1-8B-Instruct, with FP8 (W8A8) weight quantization and FP8 KV cache. Use with trtllm-build to produce an engine for inference.
Model details
| Item | Value |
|---|---|
| Base model | Llama-3.1-8B-Instruct |
| Framework | TensorRT-LLM (checkpoint format) |
| Weight quantization | FP8 (W8A8) |
| KV cache | FP8 |
| Producer | TensorRT-Model-Optimizer llm_ptq + TensorRT-LLM convert_checkpoint (--use_fp8, --fp8_kv_cache) |
| Architecture | LlamaForCausalLM (decoder-only) |
Build (how to produce this checkpoint)
FP8 requires a two-step pipeline: (1) run Model Optimizer llm_ptq to quantize the Hugging Face model to FP8; (2) run TensorRT-LLM convert_checkpoint with the PTQ output to produce this checkpoint.
1. Environment and dependencies
sudo apt install git-lfs
git lfs install
pip install tensorrt_llm --extra-index-url https://pypi.nvidia.com
# Install TensorRT-Model-Optimizer (required for FP8 quantization)
# See https://github.com/NVIDIA/TensorRT-Model-Optimizer
2. Quantize base model to FP8 (llm_ptq)
Clone the base model and run Model Optimizer's llm_ptq to produce an FP8-quantized HF-format directory. Then run TensorRT-LLM convert_checkpoint:
# Example: after llm_ptq has produced PTQ output directory,
# run convert_checkpoint with that directory as --model_dir:
python TensorRT-LLM/examples/llama/convert_checkpoint.py \
--model_dir ./path/to/ptq_output \
--output_dir ./llama-3.1-8b-instruct-trtllm-ckpt-wq_fp8-kv_fp8 \
--dtype float16 \
--use_fp8 \
--fp8_kv_cache
(Without --use_fp8, the engine can produce NaN logits; both flags are required for correct FP8 weight + FP8 KV.)
3. Output
After conversion, --output_dir contains config.json and rank0.safetensors; that is the checkpoint in this repo.
Upload (how to upload to Hugging Face)
cd ./llama-3.1-8b-instruct-trtllm-ckpt-wq_fp8-kv_fp8
huggingface-cli repo create rungalileo/llama-3.1-8b-instruct-trtllm-ckpt-wq_fp8-kv_fp8 --repo-type model
huggingface-cli upload rungalileo/llama-3.1-8b-instruct-trtllm-ckpt-wq_fp8-kv_fp8 . --repo-type model
How to use
1. Build engine
Requires TensorRT-LLM and tensorrt_llm installed:
git clone https://huggingface.co/rungalileo/llama-3.1-8b-instruct-trtllm-ckpt-wq_fp8-kv_fp8
cd llama-3.1-8b-instruct-trtllm-ckpt-wq_fp8-kv_fp8
trtllm-build --checkpoint_dir . --output_dir ./engine \
--max_batch_size 1 --max_input_len 512 --max_seq_len 1024
2. Run inference
Use a tokenizer from the base model (e.g. meta-llama/Llama-3.1-8B-Instruct):
trtllm-serve ./engine --tokenizer meta-llama/Llama-3.1-8B-Instruct --port 8000
# OpenAI-compatible API: http://localhost:8000/v1/completions
Files in this repo
config.jsonโ TensorRT-LLM model configrank0.safetensorsโ Rank 0 weights (single-GPU)
References
- Downloads last month
- 29
Model tree for rungalileo/llama-3.1-8b-instruct-trtllm-ckpt-wq_fp8-kv_fp8
Base model
meta-llama/Llama-3.1-8B