smol-tools-4b — Agentic Tool-Use Model

A 4B parameter text-only model fine-tuned for reliable tool selection, structured JSON output, and knowing when NOT to use tools. Built on Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled, trained with LoRA on 6,855 quality-filtered synthetic examples.

Architecture: Qwen3_5ForCausalLM (text-only, no vision encoder). Vision weights from the base model have been stripped — this model is purpose-built for text-based tool calling.

Need longer context? See smol-tools-4b-16k (16K context) and smol-tools-4b-32k (32K context) for multi-turn agent workflows.

Available Formats

Format	Size	Use Case
BF16 safetensors (this repo)	9.0 GB	GPU inference with transformers / vLLM
Q8_0 GGUF	4.9 GB	Near-lossless quantized — Jetson Orin NX/AGX, any 8GB+ GPU
Q4_K_M GGUF	2.9 GB	Edge deployment — Jetson Orin Nano, phones, Raspberry Pi

GGUF files available in enfuse/smol-tools-4b-GGUF.

Results (200-example held-out eval)

Metric	Score
Tool Selection F1	0.955
Tool Precision	0.955
Tool Recall	0.980
JSON Validity	100%
Argument Correctness	100%
No-Tool Accuracy	100%

Per-Scenario Breakdown

Scenario	F1	Count	Description
multi_tool_parallel	1.000	18	Multiple independent tool calls
multi_tool_sequential	1.000	36	Chained tool calls with dependencies
no_tool_needed	1.000	18	Questions answerable without tools
single_tool	0.981	53	One tool call needed
error_recovery	0.944	18	Handling malformed inputs or missing data
reasoning_heavy	0.914	35	Complex reasoning before tool selection
complex_multi_step	0.818	22	Multi-step workflows with planning

Capabilities

Tool selection: Picks the right tool(s) from a provided set with 95.5% F1
Structured output: Produces valid <tool_call>{"name": "...", "arguments": {...}}</tool_call> JSON — 100% validity
Tool refusal: Correctly answers directly when no tool is needed — 100% accuracy
Multi-tool: Handles parallel and sequential multi-tool scenarios perfectly
Reasoning: Generates chain-of-thought reasoning in <think> tags before acting

Available Tools (training set)

The model was trained with these 15 tools but generalizes to new tool schemas provided at inference:

web_search, get_webpage, execute_python, read_file, write_file, list_directory, send_email, get_current_datetime, calculate, translate, get_weather, create_calendar_event, database_query, http_request, shell_command

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "enfuse/smol-tools-4b",  # or local path
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("enfuse/smol-tools-4b", trust_remote_code=True)

tools = [
    {"type": "function", "function": {
        "name": "web_search",
        "description": "Search the web for information",
        "parameters": {"type": "object", "properties": {
            "query": {"type": "string"}
        }, "required": ["query"]}
    }}
]

messages = [
    {"role": "system", "content": "You are a helpful assistant with access to tools."},
    {"role": "user", "content": "What's the latest news about SpaceX?"},
]

prompt = tokenizer.apply_chat_template(messages, tools=tools, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False))

With vLLM (faster)

from vllm import LLM, SamplingParams

llm = LLM(model="enfuse/smol-tools-4b", dtype="bfloat16", max_model_len=4096, enforce_eager=True)
sampling = SamplingParams(max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
outputs = llm.generate([prompt], sampling)

Output Format

The model responds with optional thinking followed by tool calls or a direct answer:

With tool call:

<think>
The user wants to search for SpaceX news. I should use the web_search tool.
</think>

I'll search for the latest SpaceX news for you.

<tool_call>
{"name": "web_search", "arguments": {"query": "latest SpaceX news"}}
</tool_call>

Without tool call (direct answer):

<think>
This is a general knowledge question I can answer directly without any tools.
</think>

The capital of France is Paris. It has been the capital since...

Training Details

Parameter	Value
Base model	Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled
Method	LoRA (rank 32, alpha 64)
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training examples	6,855 (4,578 quality-filtered + 2,277 targeted)
Epochs	3
Batch size	4 (× 8 gradient accumulation = effective 32)
Learning rate	1e-4 (cosine schedule)
Max sequence length	4,096
Training loss	0.160
Token accuracy	95.7%
Training time	~5.3 hours on 1× NVIDIA H200
Framework	TRL SFTTrainer + PEFT

Data Pipeline

Teacher model: Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled generated synthetic tool-use conversations
Quality filtering: Removed examples with malformed JSON, missing tool calls, or incorrect tool usage (5,000 → 4,578)
Targeted generation: Generated 2,277 additional examples focusing on reasoning_heavy and complex_multi_step scenarios with explicit <think> tag prompting
Combined dataset: 6,855 examples across 7 scenario types

What Worked (Experiment Log)

Experiment	F1	Key Finding
Base model (no training)	0.888	Strong baseline from Claude distillation
R1: 5K unfiltered data	0.913	Fine-tuning helps
R2: 15K unfiltered data	0.905	More dirty data hurts
R3: 4.6K filtered data	0.950	Data quality > quantity
R3: 6.9K filtered + targeted	0.955	Targeted reasoning data helps
R4: 13.6K all-clean data	0.935	Too much data overfits
R4: 5 epochs	0.920	More epochs overfits
R5: Higher LoRA rank (64)	0.930	Rank 32 is sufficient
R5: Lower LR (5e-5)	0.910	1e-4 is optimal

smol-tools Family

All models share the same base architecture, tool schema, and output format. Choose based on your context length needs:

Model	Context	Tool F1	JSON Valid	No-Tool Acc	Parameters	HF Repo
smol-tools-4b	4K	0.955	100%	100%	Rank 32, α=64	this repo
smol-tools-4b-16k	16K	0.948	100%	100%	Rank 64, α=128	enfuse/smol-tools-4b-16k
smol-tools-4b-32k	32K	0.940	100%	100%	Rank 64, α=128	enfuse/smol-tools-4b-32k

How to choose:

4K (this model): Single-turn tool calls, short tool outputs — highest accuracy, lowest memory
16K: Multi-turn conversations (5-10 rounds), moderate tool outputs — also available in GGUF quantized formats
32K: Extended agent sessions (10-20 rounds), large tool outputs — also available in GGUF quantized formats

When to Use This Model

You're building an agent or copilot on the edge — local devices, Jetson, phones, on-prem servers with limited GPU
You need thousands of tool-calling inferences per minute cheaply — a 4B model serves 10–50x faster than a 70B at a fraction of the cost
You need structured output you can trust — 100% JSON validity means no crashed pipelines from malformed tool calls
You're tired of paying per-token API costs for tool-use that a small local model can handle

When NOT to Use This Model

If your agent needs multi-turn conversations or long tool outputs, use smol-tools-4b-16k or smol-tools-4b-32k instead
If you need GPT-4-level complex multi-step planning (our weakest category at F1=0.818), use a bigger model
If latency and cost don't matter, just call a frontier API — they'll outperform any 4B model on hard reasoning
If your use case requires tools not seen during training, test carefully — the model generalizes to new tool schemas but hasn't been validated on every possible tool type

Limitations

complex_multi_step scenarios (F1=0.818) remain the weakest — the model sometimes struggles with multi-step planning involving 3+ chained tools
No thinking rate in evaluation (0%) — the model reasons but doesn't always use explicit <think> tags at low temperature
Trained on synthetic data only — real-world tool-use patterns may differ
Inherits Qwen3.5-4B base model limitations (context window, knowledge cutoff)

Hardware

Training: 1× NVIDIA H200 NVL (141 GB HBM3e)
Inference (BF16): Any GPU with ≥10 GB VRAM
Inference (Q8_0 GGUF): Any device with ≥6 GB RAM — Jetson Orin NX, consumer GPUs
Inference (Q4_K_M GGUF): Any device with ≥4 GB RAM — Jetson Orin Nano, phones, Raspberry Pi 5

Attribution

Base model: Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled by Jackrong
Training framework: TRL + PEFT by HuggingFace
Inference: vLLM

Downloads last month: 23

Model tree for enfuse/smol-tools-4b

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Adapter

Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled

Adapter

(9)

this model

Evaluation results

Tool Selection F1
self-reported

0.955
JSON Validity
self-reported

1.000
No-Tool Accuracy
self-reported

1.000