smol-tools-4b-32k — Long-Context Agentic Tool-Use Model
A 4B parameter model fine-tuned for reliable tool calling with 32K context support. Handles extended multi-turn tool-use conversations, long document analysis with tool calls, and complex multi-step agent workflows — all within a single context window 8x larger than the original smol-tools-4b.
Built on Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled, trained with LoRA on 8,059 examples including multi-turn tool-use conversations up to 32K tokens.
Architecture:
Qwen3_5ForCausalLM(text-only, 32 layers, hybrid attention — 24 linear + 8 full-attention). Qwen3.5's efficient attention makes long-context inference memory-friendly.
Need less context? See smol-tools-4b-16k (16K, also available in GGUF) or smol-tools-4b (4K, highest accuracy, GGUF also available).
Available Formats
| Format | File | Size | Tool F1 | Use Case |
|---|---|---|---|---|
| BF16 safetensors | model.safetensors |
9.7 GB | 0.940 | GPU inference with transformers / vLLM |
| Q8_0 GGUF | smol-tools-4b-32k-q8_0.gguf |
4.9 GB | 0.918 | Near-lossless — Jetson Orin NX/AGX, 8GB+ GPUs |
| Q4_K_M GGUF | smol-tools-4b-32k-q4_k_m.gguf |
2.9 GB | 0.925 | Edge deployment — Jetson Orin Nano, phones, RPi 5 |
All formats maintain 100% JSON validity, 100% argument correctness, and 100% no-tool accuracy. GGUF files run with llama.cpp, ollama, or llama-cpp-python.
Why 32K Context?
The original smol-tools-4b (4K context) works well for single-turn tool calls. But real agent workflows often involve:
- Multi-turn conversations — 10-20 rounds of tool calls and results accumulating in context
- Long tool outputs — database queries returning hundreds of rows, full file contents, and lengthy web pages
- Complex planning — reasoning over many prior results to decide next steps
This model was specifically trained on multi-turn tool-use data (up to 32K tokens) so it maintains tool-calling accuracy across very long conversations, not just short single-turn queries. 32K tokens is enough for most real-world agent sessions.
Results (200-example held-out eval)
| Metric | smol-tools-4b (4K) | smol-tools-4b-32k | Delta |
|---|---|---|---|
| Tool Selection F1 | 0.955 | 0.940 | -1.5% |
| Tool Precision | 0.955 | 0.940 | -1.5% |
| Tool Recall | 0.980 | 0.965 | -1.5% |
| JSON Validity | 100% | 100% | — |
| Argument Correctness | 100% | 100% | — |
| No-Tool Accuracy | 100% | 100% | — |
| Max Context | 4,096 | 32,768 | 8x |
Only 1.5% F1 drop compared to the 4K model while supporting 8x the context length. Perfect JSON validity and argument correctness are preserved.
Per-Scenario Breakdown
| Scenario | F1 | Count | Description |
|---|---|---|---|
| multi_tool_parallel | 1.000 | 18 | Multiple independent tool calls |
| no_tool_needed | 1.000 | 18 | Questions answerable without tools |
| multi_tool_sequential | 0.972 | 36 | Chained tool calls with dependencies |
| error_recovery | 0.944 | 18 | Handling malformed inputs or missing data |
| single_tool | 0.943 | 53 | One tool call needed |
| reasoning_heavy | 0.914 | 35 | Complex reasoning before tool selection |
| complex_multi_step | 0.818 | 22 | Multi-step workflows with planning |
Capabilities
- 32K context window — handles extended multi-turn agent conversations with accumulated tool results
- Tool selection: Picks the right tool(s) from a provided set with 94.0% F1
- Structured output: Produces valid
<tool_call>{"name": "...", "arguments": {...}}</tool_call>JSON — 100% validity - Tool refusal: Correctly answers directly when no tool is needed — 100% accuracy
- Multi-tool: Handles parallel and sequential multi-tool scenarios with high accuracy
Available Tools (training set)
The model was trained with these 15 tools but generalizes to new tool schemas provided at inference:
web_search, get_webpage, execute_python, read_file, write_file, list_directory, send_email, get_current_datetime, calculate, translate, get_weather, create_calendar_event, database_query, http_request, shell_command
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"enfuse/smol-tools-4b-32k",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("enfuse/smol-tools-4b-32k", trust_remote_code=True)
tools = [
{"type": "function", "function": {
"name": "web_search",
"description": "Search the web for information",
"parameters": {"type": "object", "properties": {
"query": {"type": "string"}
}, "required": ["query"]}
}}
]
messages = [
{"role": "system", "content": "You are a helpful assistant with access to tools."},
{"role": "user", "content": "What's the latest news about SpaceX?"},
]
prompt = tokenizer.apply_chat_template(messages, tools=tools, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False))
With vLLM (faster)
from vllm import LLM, SamplingParams
llm = LLM(model="enfuse/smol-tools-4b-32k", dtype="bfloat16", max_model_len=32768, enforce_eager=True)
sampling = SamplingParams(max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
outputs = llm.generate([prompt], sampling)
With llama.cpp (GGUF, edge devices)
# Download the Q4_K_M GGUF (2.9 GB) for edge deployment
huggingface-cli download enfuse/smol-tools-4b-32k smol-tools-4b-32k-q4_k_m.gguf --local-dir .
# Run with llama-server (OpenAI-compatible API)
llama-server -m smol-tools-4b-32k-q4_k_m.gguf -c 4096 -ngl 99 --port 8080
# Or with llama-cli for one-shot inference
llama-cli -m smol-tools-4b-32k-q4_k_m.gguf -c 4096 -ngl 99 -p "<your prompt>"
# Or with llama-cpp-python
from llama_cpp import Llama
llm = Llama(model_path="smol-tools-4b-32k-q4_k_m.gguf", n_ctx=4096, n_gpu_layers=-1)
output = llm(prompt, max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
print(output["choices"][0]["text"])
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled |
| Method | LoRA (rank 64, alpha 128) |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Training examples | 8,059 (6,855 short-context + 1,204 multi-turn long-context) |
| Epochs | 3 |
| Batch size | 1 (× 32 gradient accumulation = effective 32) |
| Learning rate | 1e-4 (cosine schedule) |
| Max sequence length | 32,768 |
| Eval loss | 0.140 |
| Train loss | 0.229 |
| Training time | ~29.4 hours on 1× NVIDIA H200 |
| Framework | TRL SFTTrainer + PEFT |
Data Pipeline
- Short-context data (6,855 examples): Quality-filtered synthetic tool-use conversations from smol-tools-4b training, covering all 7 scenario types at up to 4K tokens
- Multi-turn long-context data (1,204 examples): New synthetic conversations generated by Qwen3.5-27B teacher at 64K context, featuring 8-16 rounds of tool calls with realistic tool outputs (database results, file contents, web pages). Cleaned to remove malformed tool calls and ensure proper conversation endings
- Combined: 8,059 examples with a mix of short single-turn and long multi-turn conversations, filtered to ≤32K tokens
smol-tools Family
All models share the same base architecture, tool schema, and output format. Choose based on your context length needs:
| Model | Context | Tool F1 | JSON Valid | No-Tool Acc | Parameters | HF Repo |
|---|---|---|---|---|---|---|
| smol-tools-4b | 4K | 0.955 | 100% | 100% | Rank 32, α=64 | enfuse/smol-tools-4b |
| smol-tools-4b-16k | 16K | 0.948 | 100% | 100% | Rank 64, α=128 | enfuse/smol-tools-4b-16k |
| smol-tools-4b-32k | 32K | 0.940 | 100% | 100% | Rank 64, α=128 | this repo |
How to choose:
- 4K: Single-turn tool calls, short tool outputs — highest accuracy, lowest memory — also available in GGUF quantized formats
- 16K: Multi-turn conversations (5-10 rounds), moderate tool outputs — also available in GGUF quantized formats
- 32K (this model): Extended agent sessions (10-20 rounds), large tool outputs — also available in GGUF quantized formats
When to Use This Model
- You're building an agent that needs extended multi-turn tool conversations — research assistants, code generators, data analysis pipelines with many rounds
- Your tool outputs are very long — large database results, full source files, lengthy web scrapes that push context well beyond 16K
- You need a small, fast model that can handle extended agent sessions without losing tool-calling accuracy
- You want structured output you can trust at long context — 100% JSON validity even with 32K of accumulated context
When NOT to Use This Model
- If all your tool calls are single-turn with short outputs, use smol-tools-4b (4K) — it's slightly more accurate and uses less memory
- If your conversations stay under 16K tokens, use smol-tools-4b-16k — it's slightly more accurate and trains faster
- If you need GPT-4-level complex multi-step planning (our weakest category at F1=0.818), use a bigger model
Limitations
- complex_multi_step scenarios (F1=0.818) remain the weakest — the model sometimes struggles with multi-step planning involving 3+ chained tools
- Trained on synthetic data only — real-world tool-use patterns may differ
- The 32K training data was generated by a 27B teacher model; very long conversations (>24K tokens) may show quality degradation compared to shorter ones
- Inherits Qwen3.5-4B base model limitations (knowledge cutoff)
Quantization Results (200-example eval)
| Format | Size | Tool F1 | Precision | Recall | JSON | Args | No-Tool |
|---|---|---|---|---|---|---|---|
| BF16 | 9.7 GB | 0.940 | 0.940 | 0.965 | 100% | 100% | 100% |
| Q8_0 | 4.9 GB | 0.918 | 0.917 | 0.955 | 100% | 100% | 100% |
| Q4_K_M | 2.9 GB | 0.925 | 0.925 | 0.955 | 100% | 100% | 100% |
All formats preserve perfect JSON validity and argument correctness. The Q4_K_M quantization (3.4x smaller) retains 98% of the BF16 model's tool-calling accuracy.
Hardware
- Training: 1× NVIDIA H200 NVL (141 GB HBM3e), ~29.4 hours
- Inference (BF16, 32K context): Any GPU with ≥24 GB VRAM
- Inference (BF16, 16K context): Any GPU with ≥16 GB VRAM
- Inference (BF16, 4K context): Any GPU with ≥10 GB VRAM
- Inference (Q8_0 GGUF, 32K context): Any device with ≥12 GB RAM — Jetson Orin AGX, consumer GPUs
- Inference (Q8_0 GGUF, 4K context): Any device with ≥6 GB RAM
- Inference (Q4_K_M GGUF, 32K context): Any device with ≥8 GB RAM — Jetson Orin NX, phones
- Inference (Q4_K_M GGUF, 4K context): Any device with ≥4 GB RAM — Jetson Orin Nano, RPi 5
Attribution
- Base model: Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled by Jackrong
- Training framework: TRL + PEFT by HuggingFace
- Inference: vLLM
- Downloads last month
- 387
4-bit
8-bit
Model tree for enfuse/smol-tools-4b-32k
Base model
Qwen/Qwen3.5-4B-BaseEvaluation results
- Tool Selection F1self-reported0.940
- JSON Validityself-reported1.000
- No-Tool Accuracyself-reported1.000