smol-tools-4b-32k — Long-Context Agentic Tool-Use Model

A 4B parameter model fine-tuned for reliable tool calling with 32K context support. Handles extended multi-turn tool-use conversations, long document analysis with tool calls, and complex multi-step agent workflows — all within a single context window 8x larger than the original smol-tools-4b.

Built on Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled, trained with LoRA on 8,059 examples including multi-turn tool-use conversations up to 32K tokens.

Architecture: Qwen3_5ForCausalLM (text-only, 32 layers, hybrid attention — 24 linear + 8 full-attention). Qwen3.5's efficient attention makes long-context inference memory-friendly.

Need less context? See smol-tools-4b-16k (16K, also available in GGUF) or smol-tools-4b (4K, highest accuracy, GGUF also available).

Available Formats

Format	File	Size	Tool F1	Use Case
BF16 safetensors	`model.safetensors`	9.7 GB	0.940	GPU inference with transformers / vLLM
Q8_0 GGUF	`smol-tools-4b-32k-q8_0.gguf`	4.9 GB	0.918	Near-lossless — Jetson Orin NX/AGX, 8GB+ GPUs
Q4_K_M GGUF	`smol-tools-4b-32k-q4_k_m.gguf`	2.9 GB	0.925	Edge deployment — Jetson Orin Nano, phones, RPi 5

All formats maintain 100% JSON validity, 100% argument correctness, and 100% no-tool accuracy. GGUF files run with llama.cpp, ollama, or llama-cpp-python.

Why 32K Context?

The original smol-tools-4b (4K context) works well for single-turn tool calls. But real agent workflows often involve:

Multi-turn conversations — 10-20 rounds of tool calls and results accumulating in context
Long tool outputs — database queries returning hundreds of rows, full file contents, and lengthy web pages
Complex planning — reasoning over many prior results to decide next steps

This model was specifically trained on multi-turn tool-use data (up to 32K tokens) so it maintains tool-calling accuracy across very long conversations, not just short single-turn queries. 32K tokens is enough for most real-world agent sessions.

Results (200-example held-out eval)

Metric	smol-tools-4b (4K)	smol-tools-4b-32k	Delta
Tool Selection F1	0.955	0.940	-1.5%
Tool Precision	0.955	0.940	-1.5%
Tool Recall	0.980	0.965	-1.5%
JSON Validity	100%	100%	—
Argument Correctness	100%	100%	—
No-Tool Accuracy	100%	100%	—
Max Context	4,096	32,768	8x

Only 1.5% F1 drop compared to the 4K model while supporting 8x the context length. Perfect JSON validity and argument correctness are preserved.

Per-Scenario Breakdown

Scenario	F1	Count	Description
multi_tool_parallel	1.000	18	Multiple independent tool calls
no_tool_needed	1.000	18	Questions answerable without tools
multi_tool_sequential	0.972	36	Chained tool calls with dependencies
error_recovery	0.944	18	Handling malformed inputs or missing data
single_tool	0.943	53	One tool call needed
reasoning_heavy	0.914	35	Complex reasoning before tool selection
complex_multi_step	0.818	22	Multi-step workflows with planning

Capabilities

32K context window — handles extended multi-turn agent conversations with accumulated tool results
Tool selection: Picks the right tool(s) from a provided set with 94.0% F1
Structured output: Produces valid <tool_call>{"name": "...", "arguments": {...}}</tool_call> JSON — 100% validity
Tool refusal: Correctly answers directly when no tool is needed — 100% accuracy
Multi-tool: Handles parallel and sequential multi-tool scenarios with high accuracy

Available Tools (training set)

The model was trained with these 15 tools but generalizes to new tool schemas provided at inference:

web_search, get_webpage, execute_python, read_file, write_file, list_directory, send_email, get_current_datetime, calculate, translate, get_weather, create_calendar_event, database_query, http_request, shell_command

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "enfuse/smol-tools-4b-32k",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("enfuse/smol-tools-4b-32k", trust_remote_code=True)

tools = [
    {"type": "function", "function": {
        "name": "web_search",
        "description": "Search the web for information",
        "parameters": {"type": "object", "properties": {
            "query": {"type": "string"}
        }, "required": ["query"]}
    }}
]

messages = [
    {"role": "system", "content": "You are a helpful assistant with access to tools."},
    {"role": "user", "content": "What's the latest news about SpaceX?"},
]

prompt = tokenizer.apply_chat_template(messages, tools=tools, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False))

With vLLM (faster)

from vllm import LLM, SamplingParams

llm = LLM(model="enfuse/smol-tools-4b-32k", dtype="bfloat16", max_model_len=32768, enforce_eager=True)
sampling = SamplingParams(max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
outputs = llm.generate([prompt], sampling)

With llama.cpp (GGUF, edge devices)

# Download the Q4_K_M GGUF (2.9 GB) for edge deployment
huggingface-cli download enfuse/smol-tools-4b-32k smol-tools-4b-32k-q4_k_m.gguf --local-dir .

# Run with llama-server (OpenAI-compatible API)
llama-server -m smol-tools-4b-32k-q4_k_m.gguf -c 4096 -ngl 99 --port 8080

# Or with llama-cli for one-shot inference
llama-cli -m smol-tools-4b-32k-q4_k_m.gguf -c 4096 -ngl 99 -p "<your prompt>"

# Or with llama-cpp-python
from llama_cpp import Llama

llm = Llama(model_path="smol-tools-4b-32k-q4_k_m.gguf", n_ctx=4096, n_gpu_layers=-1)
output = llm(prompt, max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
print(output["choices"][0]["text"])

Training Details

Parameter	Value
Base model	Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled
Method	LoRA (rank 64, alpha 128)
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training examples	8,059 (6,855 short-context + 1,204 multi-turn long-context)
Epochs	3
Batch size	1 (× 32 gradient accumulation = effective 32)
Learning rate	1e-4 (cosine schedule)
Max sequence length	32,768
Eval loss	0.140
Train loss	0.229
Training time	~29.4 hours on 1× NVIDIA H200
Framework	TRL SFTTrainer + PEFT

Data Pipeline

Short-context data (6,855 examples): Quality-filtered synthetic tool-use conversations from smol-tools-4b training, covering all 7 scenario types at up to 4K tokens
Multi-turn long-context data (1,204 examples): New synthetic conversations generated by Qwen3.5-27B teacher at 64K context, featuring 8-16 rounds of tool calls with realistic tool outputs (database results, file contents, web pages). Cleaned to remove malformed tool calls and ensure proper conversation endings
Combined: 8,059 examples with a mix of short single-turn and long multi-turn conversations, filtered to ≤32K tokens

smol-tools Family

All models share the same base architecture, tool schema, and output format. Choose based on your context length needs:

Model	Context	Tool F1	JSON Valid	No-Tool Acc	Parameters	HF Repo
smol-tools-4b	4K	0.955	100%	100%	Rank 32, α=64	enfuse/smol-tools-4b
smol-tools-4b-16k	16K	0.948	100%	100%	Rank 64, α=128	enfuse/smol-tools-4b-16k
smol-tools-4b-32k	32K	0.940	100%	100%	Rank 64, α=128	this repo

How to choose:

4K: Single-turn tool calls, short tool outputs — highest accuracy, lowest memory — also available in GGUF quantized formats
16K: Multi-turn conversations (5-10 rounds), moderate tool outputs — also available in GGUF quantized formats
32K (this model): Extended agent sessions (10-20 rounds), large tool outputs — also available in GGUF quantized formats

When to Use This Model

You're building an agent that needs extended multi-turn tool conversations — research assistants, code generators, data analysis pipelines with many rounds
Your tool outputs are very long — large database results, full source files, lengthy web scrapes that push context well beyond 16K
You need a small, fast model that can handle extended agent sessions without losing tool-calling accuracy
You want structured output you can trust at long context — 100% JSON validity even with 32K of accumulated context

When NOT to Use This Model

If all your tool calls are single-turn with short outputs, use smol-tools-4b (4K) — it's slightly more accurate and uses less memory
If your conversations stay under 16K tokens, use smol-tools-4b-16k — it's slightly more accurate and trains faster
If you need GPT-4-level complex multi-step planning (our weakest category at F1=0.818), use a bigger model

Limitations

complex_multi_step scenarios (F1=0.818) remain the weakest — the model sometimes struggles with multi-step planning involving 3+ chained tools
Trained on synthetic data only — real-world tool-use patterns may differ
The 32K training data was generated by a 27B teacher model; very long conversations (>24K tokens) may show quality degradation compared to shorter ones
Inherits Qwen3.5-4B base model limitations (knowledge cutoff)

Quantization Results (200-example eval)

Format	Size	Tool F1	Precision	Recall	JSON	Args	No-Tool
BF16	9.7 GB	0.940	0.940	0.965	100%	100%	100%
Q8_0	4.9 GB	0.918	0.917	0.955	100%	100%	100%
Q4_K_M	2.9 GB	0.925	0.925	0.955	100%	100%	100%

All formats preserve perfect JSON validity and argument correctness. The Q4_K_M quantization (3.4x smaller) retains 98% of the BF16 model's tool-calling accuracy.

Hardware

Training: 1× NVIDIA H200 NVL (141 GB HBM3e), ~29.4 hours
Inference (BF16, 32K context): Any GPU with ≥24 GB VRAM
Inference (BF16, 16K context): Any GPU with ≥16 GB VRAM
Inference (BF16, 4K context): Any GPU with ≥10 GB VRAM
Inference (Q8_0 GGUF, 32K context): Any device with ≥12 GB RAM — Jetson Orin AGX, consumer GPUs
Inference (Q8_0 GGUF, 4K context): Any device with ≥6 GB RAM
Inference (Q4_K_M GGUF, 32K context): Any device with ≥8 GB RAM — Jetson Orin NX, phones
Inference (Q4_K_M GGUF, 4K context): Any device with ≥4 GB RAM — Jetson Orin Nano, RPi 5

Attribution

Base model: Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled by Jackrong
Training framework: TRL + PEFT by HuggingFace
Inference: vLLM

Downloads last month: 387

GGUF

Model size

5B params

Architecture

qwen35

Hardware compatibility

4-bit

8-bit

Model tree for enfuse/smol-tools-4b-32k

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Adapter

Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled

Adapter

(8)

this model

Evaluation results

Tool Selection F1
self-reported

0.940
JSON Validity
self-reported

1.000
No-Tool Accuracy
self-reported

1.000