Llama-3.1-8B-Instruct TensorRT-LLM checkpoint (FP8 weight + FP8 KV)

TensorRT-LLM checkpoint for Llama-3.1-8B-Instruct, with FP8 (W8A8) weight quantization and FP8 KV cache. Use with trtllm-build to produce an engine for inference.

Model details

Item Value
Base model Llama-3.1-8B-Instruct
Framework TensorRT-LLM (checkpoint format)
Weight quantization FP8 (W8A8)
KV cache FP8
Producer TensorRT-Model-Optimizer llm_ptq + TensorRT-LLM convert_checkpoint (--use_fp8, --fp8_kv_cache)
Architecture LlamaForCausalLM (decoder-only)

Build (how to produce this checkpoint)

FP8 requires a two-step pipeline: (1) run Model Optimizer llm_ptq to quantize the Hugging Face model to FP8; (2) run TensorRT-LLM convert_checkpoint with the PTQ output to produce this checkpoint.

1. Environment and dependencies

sudo apt install git-lfs
git lfs install

pip install tensorrt_llm --extra-index-url https://pypi.nvidia.com
# Install TensorRT-Model-Optimizer (required for FP8 quantization)
# See https://github.com/NVIDIA/TensorRT-Model-Optimizer

2. Quantize base model to FP8 (llm_ptq)

Clone the base model and run Model Optimizer's llm_ptq to produce an FP8-quantized HF-format directory. Then run TensorRT-LLM convert_checkpoint:

# Example: after llm_ptq has produced PTQ output directory,
# run convert_checkpoint with that directory as --model_dir:
python TensorRT-LLM/examples/llama/convert_checkpoint.py \
  --model_dir ./path/to/ptq_output \
  --output_dir ./llama-3.1-8b-instruct-trtllm-ckpt-wq_fp8-kv_fp8 \
  --dtype float16 \
  --use_fp8 \
  --fp8_kv_cache

(Without --use_fp8, the engine can produce NaN logits; both flags are required for correct FP8 weight + FP8 KV.)

3. Output

After conversion, --output_dir contains config.json and rank0.safetensors; that is the checkpoint in this repo.

Upload (how to upload to Hugging Face)

cd ./llama-3.1-8b-instruct-trtllm-ckpt-wq_fp8-kv_fp8

huggingface-cli repo create rungalileo/llama-3.1-8b-instruct-trtllm-ckpt-wq_fp8-kv_fp8 --repo-type model
huggingface-cli upload rungalileo/llama-3.1-8b-instruct-trtllm-ckpt-wq_fp8-kv_fp8 . --repo-type model

How to use

1. Build engine

Requires TensorRT-LLM and tensorrt_llm installed:

git clone https://huggingface.co/rungalileo/llama-3.1-8b-instruct-trtllm-ckpt-wq_fp8-kv_fp8
cd llama-3.1-8b-instruct-trtllm-ckpt-wq_fp8-kv_fp8

trtllm-build --checkpoint_dir . --output_dir ./engine \
  --max_batch_size 1 --max_input_len 512 --max_seq_len 1024

2. Run inference

Use a tokenizer from the base model (e.g. meta-llama/Llama-3.1-8B-Instruct):

trtllm-serve ./engine --tokenizer meta-llama/Llama-3.1-8B-Instruct --port 8000
# OpenAI-compatible API: http://localhost:8000/v1/completions

Files in this repo

  • config.json โ€“ TensorRT-LLM model config
  • rank0.safetensors โ€“ Rank 0 weights (single-GPU)

References

Downloads last month
29
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for rungalileo/llama-3.1-8b-instruct-trtllm-ckpt-wq_fp8-kv_fp8

Finetuned
(2439)
this model