Llama-xLAM-2-70b-fc-r-NVFP4

NVFP4-quantized version of Salesforce/Llama-xLAM-2-70b-fc-r, produced by Enfuse.

Model Overview

Attribute	Value
Base Model	Salesforce/Llama-xLAM-2-70b-fc-r
Parameters	70B (dense)
Architecture	Dense Transformer (Llama 3.1 based)
Specialization	Agentic tool/function calling (#1 on BFCL, #1 on τ-bench)
Quantization	NVFP4 (W4A4 with FP4 weights and dynamic FP4 activations)
Format	`compressed-tensors` (safetensors)
Precision	FP4 weights (group_size=16), FP8 scales, lm_head unquantized
Approx. Size	~40 GB (down from ~132 GB in BF16)
Context Length	128,000 tokens
License	Llama 3.1 Community License

Why This Model

Salesforce's xLAM-2-70b-fc-r is the #1 open-weight model on the Berkeley Function Calling Leaderboard (BFCL) and τ-bench, outperforming GPT-4o and Claude 3.5 on agentic tool-calling tasks. At 132 GB in BF16, it requires multiple high-end GPUs to serve. This NVFP4 quantization reduces it to ~40 GB, making it deployable on a single B200 GPU while preserving its function-calling capabilities.

How to Use

vLLM (recommended)

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "enfuse/Llama-xLAM-2-70b-fc-r-NVFP4"
tokenizer = AutoTokenizer.from_pretrained(model_id)

llm = LLM(model=model_id, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)

messages = [
    {"role": "user", "content": "You have access to a function called get_weather(city: str, unit: str = 'celsius') that returns current weather. Call it for New York in fahrenheit."},
]

prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Hardware Requirements

Full NVFP4 (W4A4): Requires NVIDIA Blackwell GPU (B200, GB200, RTX 5090)
Weight-only FP4: Older GPUs (H100, A100) can load the model but will only apply weight quantization
Recommended: 1x B200 (40 GB fits within a single 183 GB GPU with ample room for KV cache)

Quantization Details

Quantized using LLM Compressor (v0.10.0):

Method: Post-training quantization (PTQ) with calibration
Calibration data: 512 samples from HuggingFaceH4/ultrachat_200k
Sequence length: 2048 tokens
Scheme: NVFP4
Excluded layers: lm_head

Infrastructure

Quantized on an NVIDIA DGX B200 (8x B200, 2 TiB RAM, CUDA 13.0).

Evaluation

BFCL v3/v4 Benchmarks (Agentic Function Calling)

Evaluated using BFCL with vLLM backend on NVIDIA B200 GPUs.

Category	NVFP4 Accuracy
Single-Turn
simple_python	95.25%
multiple	93.00%
parallel	91.00%
parallel_multiple	87.00%
simple_java	64.00%
simple_javascript	76.00%
irrelevance	83.33%
Multi-Turn (BFCL v3)
multi_turn_base	84.00%
multi_turn_miss_func	73.00%
multi_turn_miss_param	72.00%

OpenLLM v1 Benchmarks

Evaluated using lm-evaluation-harness (v0.4.11) with vLLM backend, --apply_chat_template --fewshot_as_multiturn, tensor_parallel_size=2 on NVIDIA B200 GPUs.

Benchmark	Metric	n-shot	NVFP4	BF16 Reference	Recovery
ARC-Challenge	acc_norm	25	68.26	68.9	99.1%
GSM8K	exact_match	5	84.23	84.0	100.3%
HellaSwag	acc_norm	10	83.32	86.0	96.9%
MMLU	acc	5	77.69	79.3	97.9%
TruthfulQA MC2	acc	0	54.40	60.0	90.7%
Winogrande	acc	5	81.37	83.3	97.7%

BF16 reference scores from Open LLM Leaderboard v1 (Llama-3.1-70B-Instruct base). Average recovery: ~97%.

xLAM-2-70b BF16 Reference Scores

Reference scores from the xLAM model card (different eval methodology):

Benchmark	BF16 Score
BFCL Overall	#1 on leaderboard
τ-bench Overall	56.2% (vs GPT-4o 52.9%)

About Enfuse

Enfuse builds sovereign AI infrastructure for regulated enterprises. The Enfuse platform provides on-prem LLM orchestration and an App Factory for shipping governed, compliant AI applications on your own infrastructure.

This quantization is part of our ongoing work to make large language models more accessible and efficient for on-premise deployment, where memory efficiency directly impacts what models organizations can run within their own data centers.

Acknowledgments

Salesforce AI Research for the xLAM-2-70b-fc-r model
vLLM Project for LLM Compressor
NVIDIA for the NVFP4 format and Blackwell hardware

Downloads last month: 124

Safetensors

Model size

41B params

Tensor type

F32

BF16

F8_E4M3

Model tree for enfuse/Llama-xLAM-2-70b-fc-r-NVFP4

Base model

Salesforce/Llama-xLAM-2-70b-fc-r

Quantized

(4)

this model