smol-tools-4b β€” Agentic Tool-Use Model

A 4B parameter text-only model fine-tuned for reliable tool selection, structured JSON output, and knowing when NOT to use tools. Built on Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled, trained with LoRA on 6,855 quality-filtered synthetic examples.

Architecture: Qwen3_5ForCausalLM (text-only, no vision encoder). Vision weights from the base model have been stripped β€” this model is purpose-built for text-based tool calling.

Need longer context? See smol-tools-4b-16k (16K context) and smol-tools-4b-32k (32K context) for multi-turn agent workflows.

Available Formats

Format Size Use Case
BF16 safetensors (this repo) 9.0 GB GPU inference with transformers / vLLM
Q8_0 GGUF 4.9 GB Near-lossless quantized β€” Jetson Orin NX/AGX, any 8GB+ GPU
Q4_K_M GGUF 2.9 GB Edge deployment β€” Jetson Orin Nano, phones, Raspberry Pi

GGUF files available in enfuse/smol-tools-4b-GGUF.

Results (200-example held-out eval)

Metric Score
Tool Selection F1 0.955
Tool Precision 0.955
Tool Recall 0.980
JSON Validity 100%
Argument Correctness 100%
No-Tool Accuracy 100%

Per-Scenario Breakdown

Scenario F1 Count Description
multi_tool_parallel 1.000 18 Multiple independent tool calls
multi_tool_sequential 1.000 36 Chained tool calls with dependencies
no_tool_needed 1.000 18 Questions answerable without tools
single_tool 0.981 53 One tool call needed
error_recovery 0.944 18 Handling malformed inputs or missing data
reasoning_heavy 0.914 35 Complex reasoning before tool selection
complex_multi_step 0.818 22 Multi-step workflows with planning

Capabilities

  • Tool selection: Picks the right tool(s) from a provided set with 95.5% F1
  • Structured output: Produces valid <tool_call>{"name": "...", "arguments": {...}}</tool_call> JSON β€” 100% validity
  • Tool refusal: Correctly answers directly when no tool is needed β€” 100% accuracy
  • Multi-tool: Handles parallel and sequential multi-tool scenarios perfectly
  • Reasoning: Generates chain-of-thought reasoning in <think> tags before acting

Available Tools (training set)

The model was trained with these 15 tools but generalizes to new tool schemas provided at inference:

web_search, get_webpage, execute_python, read_file, write_file, list_directory, send_email, get_current_datetime, calculate, translate, get_weather, create_calendar_event, database_query, http_request, shell_command

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "enfuse/smol-tools-4b",  # or local path
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("enfuse/smol-tools-4b", trust_remote_code=True)

tools = [
    {"type": "function", "function": {
        "name": "web_search",
        "description": "Search the web for information",
        "parameters": {"type": "object", "properties": {
            "query": {"type": "string"}
        }, "required": ["query"]}
    }}
]

messages = [
    {"role": "system", "content": "You are a helpful assistant with access to tools."},
    {"role": "user", "content": "What's the latest news about SpaceX?"},
]

prompt = tokenizer.apply_chat_template(messages, tools=tools, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False))

With vLLM (faster)

from vllm import LLM, SamplingParams

llm = LLM(model="enfuse/smol-tools-4b", dtype="bfloat16", max_model_len=4096, enforce_eager=True)
sampling = SamplingParams(max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
outputs = llm.generate([prompt], sampling)

Output Format

The model responds with optional thinking followed by tool calls or a direct answer:

With tool call:

<think>
The user wants to search for SpaceX news. I should use the web_search tool.
</think>

I'll search for the latest SpaceX news for you.

<tool_call>
{"name": "web_search", "arguments": {"query": "latest SpaceX news"}}
</tool_call>

Without tool call (direct answer):

<think>
This is a general knowledge question I can answer directly without any tools.
</think>

The capital of France is Paris. It has been the capital since...

Training Details

Parameter Value
Base model Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled
Method LoRA (rank 32, alpha 64)
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training examples 6,855 (4,578 quality-filtered + 2,277 targeted)
Epochs 3
Batch size 4 (Γ— 8 gradient accumulation = effective 32)
Learning rate 1e-4 (cosine schedule)
Max sequence length 4,096
Training loss 0.160
Token accuracy 95.7%
Training time ~5.3 hours on 1Γ— NVIDIA H200
Framework TRL SFTTrainer + PEFT

Data Pipeline

  1. Teacher model: Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled generated synthetic tool-use conversations
  2. Quality filtering: Removed examples with malformed JSON, missing tool calls, or incorrect tool usage (5,000 β†’ 4,578)
  3. Targeted generation: Generated 2,277 additional examples focusing on reasoning_heavy and complex_multi_step scenarios with explicit <think> tag prompting
  4. Combined dataset: 6,855 examples across 7 scenario types

What Worked (Experiment Log)

Experiment F1 Key Finding
Base model (no training) 0.888 Strong baseline from Claude distillation
R1: 5K unfiltered data 0.913 Fine-tuning helps
R2: 15K unfiltered data 0.905 More dirty data hurts
R3: 4.6K filtered data 0.950 Data quality > quantity
R3: 6.9K filtered + targeted 0.955 Targeted reasoning data helps
R4: 13.6K all-clean data 0.935 Too much data overfits
R4: 5 epochs 0.920 More epochs overfits
R5: Higher LoRA rank (64) 0.930 Rank 32 is sufficient
R5: Lower LR (5e-5) 0.910 1e-4 is optimal

smol-tools Family

All models share the same base architecture, tool schema, and output format. Choose based on your context length needs:

Model Context Tool F1 JSON Valid No-Tool Acc Parameters HF Repo
smol-tools-4b 4K 0.955 100% 100% Rank 32, Ξ±=64 this repo
smol-tools-4b-16k 16K 0.948 100% 100% Rank 64, Ξ±=128 enfuse/smol-tools-4b-16k
smol-tools-4b-32k 32K 0.940 100% 100% Rank 64, Ξ±=128 enfuse/smol-tools-4b-32k

How to choose:

  • 4K (this model): Single-turn tool calls, short tool outputs β€” highest accuracy, lowest memory
  • 16K: Multi-turn conversations (5-10 rounds), moderate tool outputs β€” also available in GGUF quantized formats
  • 32K: Extended agent sessions (10-20 rounds), large tool outputs β€” also available in GGUF quantized formats

When to Use This Model

  • You're building an agent or copilot on the edge β€” local devices, Jetson, phones, on-prem servers with limited GPU
  • You need thousands of tool-calling inferences per minute cheaply β€” a 4B model serves 10–50x faster than a 70B at a fraction of the cost
  • You need structured output you can trust β€” 100% JSON validity means no crashed pipelines from malformed tool calls
  • You're tired of paying per-token API costs for tool-use that a small local model can handle

When NOT to Use This Model

  • If your agent needs multi-turn conversations or long tool outputs, use smol-tools-4b-16k or smol-tools-4b-32k instead
  • If you need GPT-4-level complex multi-step planning (our weakest category at F1=0.818), use a bigger model
  • If latency and cost don't matter, just call a frontier API β€” they'll outperform any 4B model on hard reasoning
  • If your use case requires tools not seen during training, test carefully β€” the model generalizes to new tool schemas but hasn't been validated on every possible tool type

Limitations

  • complex_multi_step scenarios (F1=0.818) remain the weakest β€” the model sometimes struggles with multi-step planning involving 3+ chained tools
  • No thinking rate in evaluation (0%) β€” the model reasons but doesn't always use explicit <think> tags at low temperature
  • Trained on synthetic data only β€” real-world tool-use patterns may differ
  • Inherits Qwen3.5-4B base model limitations (context window, knowledge cutoff)

Hardware

  • Training: 1Γ— NVIDIA H200 NVL (141 GB HBM3e)
  • Inference (BF16): Any GPU with β‰₯10 GB VRAM
  • Inference (Q8_0 GGUF): Any device with β‰₯6 GB RAM β€” Jetson Orin NX, consumer GPUs
  • Inference (Q4_K_M GGUF): Any device with β‰₯4 GB RAM β€” Jetson Orin Nano, phones, Raspberry Pi 5

Attribution

Downloads last month
23
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for enfuse/smol-tools-4b

Evaluation results