Llama-xLAM-2-70b-fc-r-NVFP4
NVFP4-quantized version of Salesforce/Llama-xLAM-2-70b-fc-r, produced by Enfuse.
Model Overview
| Attribute | Value |
|---|---|
| Base Model | Salesforce/Llama-xLAM-2-70b-fc-r |
| Parameters | 70B (dense) |
| Architecture | Dense Transformer (Llama 3.1 based) |
| Specialization | Agentic tool/function calling (#1 on BFCL, #1 on 蟿-bench) |
| Quantization | NVFP4 (W4A4 with FP4 weights and dynamic FP4 activations) |
| Format | compressed-tensors (safetensors) |
| Precision | FP4 weights (group_size=16), FP8 scales, lm_head unquantized |
| Approx. Size | ~40 GB (down from ~132 GB in BF16) |
| Context Length | 128,000 tokens |
| License | Llama 3.1 Community License |
Why This Model
Salesforce's xLAM-2-70b-fc-r is the #1 open-weight model on the Berkeley Function Calling Leaderboard (BFCL) and 蟿-bench, outperforming GPT-4o and Claude 3.5 on agentic tool-calling tasks. At 132 GB in BF16, it requires multiple high-end GPUs to serve. This NVFP4 quantization reduces it to ~40 GB, making it deployable on a single B200 GPU while preserving its function-calling capabilities.
How to Use
vLLM (recommended)
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "enfuse/Llama-xLAM-2-70b-fc-r-NVFP4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
llm = LLM(model=model_id, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
messages = [
{"role": "user", "content": "You have access to a function called get_weather(city: str, unit: str = 'celsius') that returns current weather. Call it for New York in fahrenheit."},
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)
Hardware Requirements
- Full NVFP4 (W4A4): Requires NVIDIA Blackwell GPU (B200, GB200, RTX 5090)
- Weight-only FP4: Older GPUs (H100, A100) can load the model but will only apply weight quantization
- Recommended: 1x B200 (40 GB fits within a single 183 GB GPU with ample room for KV cache)
Quantization Details
Quantized using LLM Compressor (v0.10.0):
- Method: Post-training quantization (PTQ) with calibration
- Calibration data: 512 samples from HuggingFaceH4/ultrachat_200k
- Sequence length: 2048 tokens
- Scheme:
NVFP4 - Excluded layers:
lm_head
Infrastructure
Quantized on an NVIDIA DGX B200 (8x B200, 2 TiB RAM, CUDA 13.0).
Evaluation
BFCL v3/v4 Benchmarks (Agentic Function Calling)
Evaluated using BFCL with vLLM backend on NVIDIA B200 GPUs.
| Category | NVFP4 Accuracy |
|---|---|
| Single-Turn | |
| simple_python | 95.25% |
| multiple | 93.00% |
| parallel | 91.00% |
| parallel_multiple | 87.00% |
| simple_java | 64.00% |
| simple_javascript | 76.00% |
| irrelevance | 83.33% |
| Multi-Turn (BFCL v3) | |
| multi_turn_base | 84.00% |
| multi_turn_miss_func | 73.00% |
| multi_turn_miss_param | 72.00% |
OpenLLM v1 Benchmarks
Evaluated using lm-evaluation-harness (v0.4.11) with vLLM backend, --apply_chat_template --fewshot_as_multiturn, tensor_parallel_size=2 on NVIDIA B200 GPUs.
| Benchmark | Metric | n-shot | NVFP4 | BF16 Reference | Recovery |
|---|---|---|---|---|---|
| ARC-Challenge | acc_norm | 25 | 68.26 | 68.9 | 99.1% |
| GSM8K | exact_match | 5 | 84.23 | 84.0 | 100.3% |
| HellaSwag | acc_norm | 10 | 83.32 | 86.0 | 96.9% |
| MMLU | acc | 5 | 77.69 | 79.3 | 97.9% |
| TruthfulQA MC2 | acc | 0 | 54.40 | 60.0 | 90.7% |
| Winogrande | acc | 5 | 81.37 | 83.3 | 97.7% |
BF16 reference scores from Open LLM Leaderboard v1 (Llama-3.1-70B-Instruct base). Average recovery: ~97%.
xLAM-2-70b BF16 Reference Scores
Reference scores from the xLAM model card (different eval methodology):
| Benchmark | BF16 Score |
|---|---|
| BFCL Overall | #1 on leaderboard |
| 蟿-bench Overall | 56.2% (vs GPT-4o 52.9%) |
About Enfuse
Enfuse builds sovereign AI infrastructure for regulated enterprises. The Enfuse platform provides on-prem LLM orchestration and an App Factory for shipping governed, compliant AI applications on your own infrastructure.
This quantization is part of our ongoing work to make large language models more accessible and efficient for on-premise deployment, where memory efficiency directly impacts what models organizations can run within their own data centers.
Acknowledgments
- Salesforce AI Research for the xLAM-2-70b-fc-r model
- vLLM Project for LLM Compressor
- NVIDIA for the NVFP4 format and Blackwell hardware
- Downloads last month
- 124
Model tree for enfuse/Llama-xLAM-2-70b-fc-r-NVFP4
Base model
Salesforce/Llama-xLAM-2-70b-fc-r