Llama-xLAM-2-70b-fc-r-NVFP4

NVFP4-quantized version of Salesforce/Llama-xLAM-2-70b-fc-r, produced by Enfuse.

Model Overview

Attribute Value
Base Model Salesforce/Llama-xLAM-2-70b-fc-r
Parameters 70B (dense)
Architecture Dense Transformer (Llama 3.1 based)
Specialization Agentic tool/function calling (#1 on BFCL, #1 on 蟿-bench)
Quantization NVFP4 (W4A4 with FP4 weights and dynamic FP4 activations)
Format compressed-tensors (safetensors)
Precision FP4 weights (group_size=16), FP8 scales, lm_head unquantized
Approx. Size ~40 GB (down from ~132 GB in BF16)
Context Length 128,000 tokens
License Llama 3.1 Community License

Why This Model

Salesforce's xLAM-2-70b-fc-r is the #1 open-weight model on the Berkeley Function Calling Leaderboard (BFCL) and 蟿-bench, outperforming GPT-4o and Claude 3.5 on agentic tool-calling tasks. At 132 GB in BF16, it requires multiple high-end GPUs to serve. This NVFP4 quantization reduces it to ~40 GB, making it deployable on a single B200 GPU while preserving its function-calling capabilities.

How to Use

vLLM (recommended)

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "enfuse/Llama-xLAM-2-70b-fc-r-NVFP4"
tokenizer = AutoTokenizer.from_pretrained(model_id)

llm = LLM(model=model_id, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)

messages = [
    {"role": "user", "content": "You have access to a function called get_weather(city: str, unit: str = 'celsius') that returns current weather. Call it for New York in fahrenheit."},
]

prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Hardware Requirements

  • Full NVFP4 (W4A4): Requires NVIDIA Blackwell GPU (B200, GB200, RTX 5090)
  • Weight-only FP4: Older GPUs (H100, A100) can load the model but will only apply weight quantization
  • Recommended: 1x B200 (40 GB fits within a single 183 GB GPU with ample room for KV cache)

Quantization Details

Quantized using LLM Compressor (v0.10.0):

  • Method: Post-training quantization (PTQ) with calibration
  • Calibration data: 512 samples from HuggingFaceH4/ultrachat_200k
  • Sequence length: 2048 tokens
  • Scheme: NVFP4
  • Excluded layers: lm_head

Infrastructure

Quantized on an NVIDIA DGX B200 (8x B200, 2 TiB RAM, CUDA 13.0).

Evaluation

BFCL v3/v4 Benchmarks (Agentic Function Calling)

Evaluated using BFCL with vLLM backend on NVIDIA B200 GPUs.

Category NVFP4 Accuracy
Single-Turn
simple_python 95.25%
multiple 93.00%
parallel 91.00%
parallel_multiple 87.00%
simple_java 64.00%
simple_javascript 76.00%
irrelevance 83.33%
Multi-Turn (BFCL v3)
multi_turn_base 84.00%
multi_turn_miss_func 73.00%
multi_turn_miss_param 72.00%

OpenLLM v1 Benchmarks

Evaluated using lm-evaluation-harness (v0.4.11) with vLLM backend, --apply_chat_template --fewshot_as_multiturn, tensor_parallel_size=2 on NVIDIA B200 GPUs.

Benchmark Metric n-shot NVFP4 BF16 Reference Recovery
ARC-Challenge acc_norm 25 68.26 68.9 99.1%
GSM8K exact_match 5 84.23 84.0 100.3%
HellaSwag acc_norm 10 83.32 86.0 96.9%
MMLU acc 5 77.69 79.3 97.9%
TruthfulQA MC2 acc 0 54.40 60.0 90.7%
Winogrande acc 5 81.37 83.3 97.7%

BF16 reference scores from Open LLM Leaderboard v1 (Llama-3.1-70B-Instruct base). Average recovery: ~97%.

xLAM-2-70b BF16 Reference Scores

Reference scores from the xLAM model card (different eval methodology):

Benchmark BF16 Score
BFCL Overall #1 on leaderboard
蟿-bench Overall 56.2% (vs GPT-4o 52.9%)

About Enfuse

Enfuse builds sovereign AI infrastructure for regulated enterprises. The Enfuse platform provides on-prem LLM orchestration and an App Factory for shipping governed, compliant AI applications on your own infrastructure.

This quantization is part of our ongoing work to make large language models more accessible and efficient for on-premise deployment, where memory efficiency directly impacts what models organizations can run within their own data centers.

Acknowledgments

Downloads last month
124
Safetensors
Model size
41B params
Tensor type
F32
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for enfuse/Llama-xLAM-2-70b-fc-r-NVFP4

Quantized
(4)
this model