--- license: llama3.1 base_model: Salesforce/Llama-xLAM-2-70b-fc-r tags: - nvfp4 - quantized - 4-bit - blackwell - xlam - llama-3.1 - function-calling - tool-use - agentic - compressed-tensors - vllm library_name: transformers pipeline_tag: text-generation model_type: llama quantized_by: enfuse --- # Llama-xLAM-2-70b-fc-r-NVFP4 NVFP4-quantized version of [Salesforce/Llama-xLAM-2-70b-fc-r](https://huggingface.co/Salesforce/Llama-xLAM-2-70b-fc-r), produced by [Enfuse](https://enfuse.io). ## Model Overview | Attribute | Value | |-----------|-------| | Base Model | Salesforce/Llama-xLAM-2-70b-fc-r | | Parameters | 70B (dense) | | Architecture | Dense Transformer (Llama 3.1 based) | | Specialization | Agentic tool/function calling (#1 on BFCL, #1 on τ-bench) | | Quantization | NVFP4 (W4A4 with FP4 weights and dynamic FP4 activations) | | Format | `compressed-tensors` (safetensors) | | Precision | FP4 weights (group_size=16), FP8 scales, lm_head unquantized | | Approx. Size | ~40 GB (down from ~132 GB in BF16) | | Context Length | 128,000 tokens | | License | Llama 3.1 Community License | ## Why This Model Salesforce's xLAM-2-70b-fc-r is the #1 open-weight model on the [Berkeley Function Calling Leaderboard (BFCL)](https://gorilla.cs.berkeley.edu/leaderboard.html) and [τ-bench](https://github.com/sierra-research/tau-bench), outperforming GPT-4o and Claude 3.5 on agentic tool-calling tasks. At 132 GB in BF16, it requires multiple high-end GPUs to serve. This NVFP4 quantization reduces it to ~40 GB, making it deployable on a single B200 GPU while preserving its function-calling capabilities. ## How to Use ### vLLM (recommended) ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer model_id = "enfuse/Llama-xLAM-2-70b-fc-r-NVFP4" tokenizer = AutoTokenizer.from_pretrained(model_id) llm = LLM(model=model_id, tensor_parallel_size=1) sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512) messages = [ {"role": "user", "content": "You have access to a function called get_weather(city: str, unit: str = 'celsius') that returns current weather. Call it for New York in fahrenheit."}, ] prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) outputs = llm.generate([prompt], sampling_params) print(outputs[0].outputs[0].text) ``` ### Hardware Requirements - **Full NVFP4 (W4A4)**: Requires NVIDIA Blackwell GPU (B200, GB200, RTX 5090) - **Weight-only FP4**: Older GPUs (H100, A100) can load the model but will only apply weight quantization - **Recommended**: 1x B200 (40 GB fits within a single 183 GB GPU with ample room for KV cache) ## Quantization Details Quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor) (v0.10.0): - **Method**: Post-training quantization (PTQ) with calibration - **Calibration data**: 512 samples from [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) - **Sequence length**: 2048 tokens - **Scheme**: `NVFP4` - **Excluded layers**: `lm_head` ### Infrastructure Quantized on an NVIDIA DGX B200 (8x B200, 2 TiB RAM, CUDA 13.0). ## Evaluation ### BFCL v3/v4 Benchmarks (Agentic Function Calling) Evaluated using [BFCL](https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard) with vLLM backend on NVIDIA B200 GPUs. | Category | NVFP4 Accuracy | |----------|---------------| | **Single-Turn** | | | simple_python | 95.25% | | multiple | 93.00% | | parallel | 91.00% | | parallel_multiple | 87.00% | | simple_java | 64.00% | | simple_javascript | 76.00% | | irrelevance | 83.33% | | **Multi-Turn (BFCL v3)** | | | multi_turn_base | 84.00% | | multi_turn_miss_func | 73.00% | | multi_turn_miss_param | 72.00% | ### OpenLLM v1 Benchmarks Evaluated using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (v0.4.11) with vLLM backend, `--apply_chat_template --fewshot_as_multiturn`, tensor_parallel_size=2 on NVIDIA B200 GPUs. | Benchmark | Metric | n-shot | NVFP4 | BF16 Reference | Recovery | |-----------|--------|--------|-------|----------------|----------| | ARC-Challenge | acc_norm | 25 | 68.26 | 68.9 | 99.1% | | GSM8K | exact_match | 5 | 84.23 | 84.0 | 100.3% | | HellaSwag | acc_norm | 10 | 83.32 | 86.0 | 96.9% | | MMLU | acc | 5 | 77.69 | 79.3 | 97.9% | | TruthfulQA MC2 | acc | 0 | 54.40 | 60.0 | 90.7% | | Winogrande | acc | 5 | 81.37 | 83.3 | 97.7% | BF16 reference scores from [Open LLM Leaderboard v1](https://huggingface.co/datasets/open-llm-leaderboard/details_Salesforce__Llama-xLAM-2-70b-fc-r) (Llama-3.1-70B-Instruct base). Average recovery: ~97%. ### xLAM-2-70b BF16 Reference Scores Reference scores from the [xLAM model card](https://huggingface.co/Salesforce/Llama-xLAM-2-70b-fc-r) (different eval methodology): | Benchmark | BF16 Score | |-----------|-----------| | BFCL Overall | #1 on leaderboard | | τ-bench Overall | 56.2% (vs GPT-4o 52.9%) | ## About Enfuse [Enfuse](https://enfuse.io) builds sovereign AI infrastructure for regulated enterprises. The Enfuse platform provides on-prem LLM orchestration and an App Factory for shipping governed, compliant AI applications on your own infrastructure. This quantization is part of our ongoing work to make large language models more accessible and efficient for on-premise deployment, where memory efficiency directly impacts what models organizations can run within their own data centers. ## Acknowledgments - [Salesforce AI Research](https://github.com/SalesforceAIResearch/xLAM) for the xLAM-2-70b-fc-r model - [vLLM Project](https://github.com/vllm-project/llm-compressor) for LLM Compressor - [NVIDIA](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/) for the NVFP4 format and Blackwell hardware