Cassiopeia-70B-fp8

Format: FP8_DYNAMIC — weights quantized to FP8 statically; activations scaled dynamically at runtime.
Base model: ddh0/Cassiopeia-70B
How it was made: One-shot datafree quantization with LLM Compressor (FP8_DYNAMIC recipe) on a DGX Spark (GB10 Grace Blackwell). No calibration data required — activations are scaled dynamically at runtime.

Notes: lm_head and multimodal projection layers kept in high precision. Blackwell (GB10/B100/B200) has native FP8 hardware support. Hopper (H100/H200) also supports FP8 natively. Older architectures will fall back to BF16 compute while still benefiting from the reduced model size.

Check the original model card for information about this model.

Running the model with VLLM in Docker

sudo docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host vllm/vllm-openai:latest  --model Firworks/Cassiopeia-70B-fp8 --dtype auto --max-model-len 32768

Running the model on the DGX Spark with VLLM in Docker

sudo docker run --gpus all --network host --ipc=host   nvcr.io/nvidia/vllm:26.02-py3   vllm serve Firworks/Cassiopeia-70B-fp8   --dtype auto   --max-model-len 32768

Tested on a DGX Spark (GB10 Grace Blackwell Superchip, 128GB unified memory).

If there are other models you'd like quantized to FP8, let me know.

Downloads last month
48
Safetensors
Model size
71B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Firworks/Cassiopeia-70B-fp8

Quantized
(7)
this model