smol-tools-4b β Agentic Tool-Use Model
A 4B parameter text-only model fine-tuned for reliable tool selection, structured JSON output, and knowing when NOT to use tools. Built on Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled, trained with LoRA on 6,855 quality-filtered synthetic examples.
Architecture:
Qwen3_5ForCausalLM(text-only, no vision encoder). Vision weights from the base model have been stripped β this model is purpose-built for text-based tool calling.
Need longer context? See smol-tools-4b-16k (16K context) and smol-tools-4b-32k (32K context) for multi-turn agent workflows.
Available Formats
| Format | Size | Use Case |
|---|---|---|
| BF16 safetensors (this repo) | 9.0 GB | GPU inference with transformers / vLLM |
| Q8_0 GGUF | 4.9 GB | Near-lossless quantized β Jetson Orin NX/AGX, any 8GB+ GPU |
| Q4_K_M GGUF | 2.9 GB | Edge deployment β Jetson Orin Nano, phones, Raspberry Pi |
GGUF files available in enfuse/smol-tools-4b-GGUF.
Results (200-example held-out eval)
| Metric | Score |
|---|---|
| Tool Selection F1 | 0.955 |
| Tool Precision | 0.955 |
| Tool Recall | 0.980 |
| JSON Validity | 100% |
| Argument Correctness | 100% |
| No-Tool Accuracy | 100% |
Per-Scenario Breakdown
| Scenario | F1 | Count | Description |
|---|---|---|---|
| multi_tool_parallel | 1.000 | 18 | Multiple independent tool calls |
| multi_tool_sequential | 1.000 | 36 | Chained tool calls with dependencies |
| no_tool_needed | 1.000 | 18 | Questions answerable without tools |
| single_tool | 0.981 | 53 | One tool call needed |
| error_recovery | 0.944 | 18 | Handling malformed inputs or missing data |
| reasoning_heavy | 0.914 | 35 | Complex reasoning before tool selection |
| complex_multi_step | 0.818 | 22 | Multi-step workflows with planning |
Capabilities
- Tool selection: Picks the right tool(s) from a provided set with 95.5% F1
- Structured output: Produces valid
<tool_call>{"name": "...", "arguments": {...}}</tool_call>JSON β 100% validity - Tool refusal: Correctly answers directly when no tool is needed β 100% accuracy
- Multi-tool: Handles parallel and sequential multi-tool scenarios perfectly
- Reasoning: Generates chain-of-thought reasoning in
<think>tags before acting
Available Tools (training set)
The model was trained with these 15 tools but generalizes to new tool schemas provided at inference:
web_search, get_webpage, execute_python, read_file, write_file, list_directory, send_email, get_current_datetime, calculate, translate, get_weather, create_calendar_event, database_query, http_request, shell_command
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"enfuse/smol-tools-4b", # or local path
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("enfuse/smol-tools-4b", trust_remote_code=True)
tools = [
{"type": "function", "function": {
"name": "web_search",
"description": "Search the web for information",
"parameters": {"type": "object", "properties": {
"query": {"type": "string"}
}, "required": ["query"]}
}}
]
messages = [
{"role": "system", "content": "You are a helpful assistant with access to tools."},
{"role": "user", "content": "What's the latest news about SpaceX?"},
]
prompt = tokenizer.apply_chat_template(messages, tools=tools, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False))
With vLLM (faster)
from vllm import LLM, SamplingParams
llm = LLM(model="enfuse/smol-tools-4b", dtype="bfloat16", max_model_len=4096, enforce_eager=True)
sampling = SamplingParams(max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
outputs = llm.generate([prompt], sampling)
Output Format
The model responds with optional thinking followed by tool calls or a direct answer:
With tool call:
<think>
The user wants to search for SpaceX news. I should use the web_search tool.
</think>
I'll search for the latest SpaceX news for you.
<tool_call>
{"name": "web_search", "arguments": {"query": "latest SpaceX news"}}
</tool_call>
Without tool call (direct answer):
<think>
This is a general knowledge question I can answer directly without any tools.
</think>
The capital of France is Paris. It has been the capital since...
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled |
| Method | LoRA (rank 32, alpha 64) |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Training examples | 6,855 (4,578 quality-filtered + 2,277 targeted) |
| Epochs | 3 |
| Batch size | 4 (Γ 8 gradient accumulation = effective 32) |
| Learning rate | 1e-4 (cosine schedule) |
| Max sequence length | 4,096 |
| Training loss | 0.160 |
| Token accuracy | 95.7% |
| Training time | ~5.3 hours on 1Γ NVIDIA H200 |
| Framework | TRL SFTTrainer + PEFT |
Data Pipeline
- Teacher model: Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled generated synthetic tool-use conversations
- Quality filtering: Removed examples with malformed JSON, missing tool calls, or incorrect tool usage (5,000 β 4,578)
- Targeted generation: Generated 2,277 additional examples focusing on
reasoning_heavyandcomplex_multi_stepscenarios with explicit<think>tag prompting - Combined dataset: 6,855 examples across 7 scenario types
What Worked (Experiment Log)
| Experiment | F1 | Key Finding |
|---|---|---|
| Base model (no training) | 0.888 | Strong baseline from Claude distillation |
| R1: 5K unfiltered data | 0.913 | Fine-tuning helps |
| R2: 15K unfiltered data | 0.905 | More dirty data hurts |
| R3: 4.6K filtered data | 0.950 | Data quality > quantity |
| R3: 6.9K filtered + targeted | 0.955 | Targeted reasoning data helps |
| R4: 13.6K all-clean data | 0.935 | Too much data overfits |
| R4: 5 epochs | 0.920 | More epochs overfits |
| R5: Higher LoRA rank (64) | 0.930 | Rank 32 is sufficient |
| R5: Lower LR (5e-5) | 0.910 | 1e-4 is optimal |
smol-tools Family
All models share the same base architecture, tool schema, and output format. Choose based on your context length needs:
| Model | Context | Tool F1 | JSON Valid | No-Tool Acc | Parameters | HF Repo |
|---|---|---|---|---|---|---|
| smol-tools-4b | 4K | 0.955 | 100% | 100% | Rank 32, Ξ±=64 | this repo |
| smol-tools-4b-16k | 16K | 0.948 | 100% | 100% | Rank 64, Ξ±=128 | enfuse/smol-tools-4b-16k |
| smol-tools-4b-32k | 32K | 0.940 | 100% | 100% | Rank 64, Ξ±=128 | enfuse/smol-tools-4b-32k |
How to choose:
- 4K (this model): Single-turn tool calls, short tool outputs β highest accuracy, lowest memory
- 16K: Multi-turn conversations (5-10 rounds), moderate tool outputs β also available in GGUF quantized formats
- 32K: Extended agent sessions (10-20 rounds), large tool outputs β also available in GGUF quantized formats
When to Use This Model
- You're building an agent or copilot on the edge β local devices, Jetson, phones, on-prem servers with limited GPU
- You need thousands of tool-calling inferences per minute cheaply β a 4B model serves 10β50x faster than a 70B at a fraction of the cost
- You need structured output you can trust β 100% JSON validity means no crashed pipelines from malformed tool calls
- You're tired of paying per-token API costs for tool-use that a small local model can handle
When NOT to Use This Model
- If your agent needs multi-turn conversations or long tool outputs, use smol-tools-4b-16k or smol-tools-4b-32k instead
- If you need GPT-4-level complex multi-step planning (our weakest category at F1=0.818), use a bigger model
- If latency and cost don't matter, just call a frontier API β they'll outperform any 4B model on hard reasoning
- If your use case requires tools not seen during training, test carefully β the model generalizes to new tool schemas but hasn't been validated on every possible tool type
Limitations
- complex_multi_step scenarios (F1=0.818) remain the weakest β the model sometimes struggles with multi-step planning involving 3+ chained tools
- No thinking rate in evaluation (0%) β the model reasons but doesn't always use explicit
<think>tags at low temperature - Trained on synthetic data only β real-world tool-use patterns may differ
- Inherits Qwen3.5-4B base model limitations (context window, knowledge cutoff)
Hardware
- Training: 1Γ NVIDIA H200 NVL (141 GB HBM3e)
- Inference (BF16): Any GPU with β₯10 GB VRAM
- Inference (Q8_0 GGUF): Any device with β₯6 GB RAM β Jetson Orin NX, consumer GPUs
- Inference (Q4_K_M GGUF): Any device with β₯4 GB RAM β Jetson Orin Nano, phones, Raspberry Pi 5
Attribution
- Base model: Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled by Jackrong
- Training framework: TRL + PEFT by HuggingFace
- Inference: vLLM
- Downloads last month
- 23
Model tree for enfuse/smol-tools-4b
Base model
Qwen/Qwen3.5-4B-BaseEvaluation results
- Tool Selection F1self-reported0.955
- JSON Validityself-reported1.000
- No-Tool Accuracyself-reported1.000