smol-tools-4b-32k — Long-Context Agentic Tool-Use Model

A 4B parameter model fine-tuned for reliable tool calling with 32K context support. Handles extended multi-turn tool-use conversations, long document analysis with tool calls, and complex multi-step agent workflows — all within a single context window 8x larger than the original smol-tools-4b.

Built on Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled, trained with LoRA on 8,059 examples including multi-turn tool-use conversations up to 32K tokens.

Architecture: Qwen3_5ForCausalLM (text-only, 32 layers, hybrid attention — 24 linear + 8 full-attention). Qwen3.5's efficient attention makes long-context inference memory-friendly.

Need less context? See smol-tools-4b-16k (16K, also available in GGUF) or smol-tools-4b (4K, highest accuracy, GGUF also available).

Available Formats

Format File Size Tool F1 Use Case
BF16 safetensors model.safetensors 9.7 GB 0.940 GPU inference with transformers / vLLM
Q8_0 GGUF smol-tools-4b-32k-q8_0.gguf 4.9 GB 0.918 Near-lossless — Jetson Orin NX/AGX, 8GB+ GPUs
Q4_K_M GGUF smol-tools-4b-32k-q4_k_m.gguf 2.9 GB 0.925 Edge deployment — Jetson Orin Nano, phones, RPi 5

All formats maintain 100% JSON validity, 100% argument correctness, and 100% no-tool accuracy. GGUF files run with llama.cpp, ollama, or llama-cpp-python.

Why 32K Context?

The original smol-tools-4b (4K context) works well for single-turn tool calls. But real agent workflows often involve:

  • Multi-turn conversations — 10-20 rounds of tool calls and results accumulating in context
  • Long tool outputs — database queries returning hundreds of rows, full file contents, and lengthy web pages
  • Complex planning — reasoning over many prior results to decide next steps

This model was specifically trained on multi-turn tool-use data (up to 32K tokens) so it maintains tool-calling accuracy across very long conversations, not just short single-turn queries. 32K tokens is enough for most real-world agent sessions.

Results (200-example held-out eval)

Metric smol-tools-4b (4K) smol-tools-4b-32k Delta
Tool Selection F1 0.955 0.940 -1.5%
Tool Precision 0.955 0.940 -1.5%
Tool Recall 0.980 0.965 -1.5%
JSON Validity 100% 100%
Argument Correctness 100% 100%
No-Tool Accuracy 100% 100%
Max Context 4,096 32,768 8x

Only 1.5% F1 drop compared to the 4K model while supporting 8x the context length. Perfect JSON validity and argument correctness are preserved.

Per-Scenario Breakdown

Scenario F1 Count Description
multi_tool_parallel 1.000 18 Multiple independent tool calls
no_tool_needed 1.000 18 Questions answerable without tools
multi_tool_sequential 0.972 36 Chained tool calls with dependencies
error_recovery 0.944 18 Handling malformed inputs or missing data
single_tool 0.943 53 One tool call needed
reasoning_heavy 0.914 35 Complex reasoning before tool selection
complex_multi_step 0.818 22 Multi-step workflows with planning

Capabilities

  • 32K context window — handles extended multi-turn agent conversations with accumulated tool results
  • Tool selection: Picks the right tool(s) from a provided set with 94.0% F1
  • Structured output: Produces valid <tool_call>{"name": "...", "arguments": {...}}</tool_call> JSON — 100% validity
  • Tool refusal: Correctly answers directly when no tool is needed — 100% accuracy
  • Multi-tool: Handles parallel and sequential multi-tool scenarios with high accuracy

Available Tools (training set)

The model was trained with these 15 tools but generalizes to new tool schemas provided at inference:

web_search, get_webpage, execute_python, read_file, write_file, list_directory, send_email, get_current_datetime, calculate, translate, get_weather, create_calendar_event, database_query, http_request, shell_command

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "enfuse/smol-tools-4b-32k",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("enfuse/smol-tools-4b-32k", trust_remote_code=True)

tools = [
    {"type": "function", "function": {
        "name": "web_search",
        "description": "Search the web for information",
        "parameters": {"type": "object", "properties": {
            "query": {"type": "string"}
        }, "required": ["query"]}
    }}
]

messages = [
    {"role": "system", "content": "You are a helpful assistant with access to tools."},
    {"role": "user", "content": "What's the latest news about SpaceX?"},
]

prompt = tokenizer.apply_chat_template(messages, tools=tools, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False))

With vLLM (faster)

from vllm import LLM, SamplingParams

llm = LLM(model="enfuse/smol-tools-4b-32k", dtype="bfloat16", max_model_len=32768, enforce_eager=True)
sampling = SamplingParams(max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
outputs = llm.generate([prompt], sampling)

With llama.cpp (GGUF, edge devices)

# Download the Q4_K_M GGUF (2.9 GB) for edge deployment
huggingface-cli download enfuse/smol-tools-4b-32k smol-tools-4b-32k-q4_k_m.gguf --local-dir .

# Run with llama-server (OpenAI-compatible API)
llama-server -m smol-tools-4b-32k-q4_k_m.gguf -c 4096 -ngl 99 --port 8080

# Or with llama-cli for one-shot inference
llama-cli -m smol-tools-4b-32k-q4_k_m.gguf -c 4096 -ngl 99 -p "<your prompt>"
# Or with llama-cpp-python
from llama_cpp import Llama

llm = Llama(model_path="smol-tools-4b-32k-q4_k_m.gguf", n_ctx=4096, n_gpu_layers=-1)
output = llm(prompt, max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
print(output["choices"][0]["text"])

Training Details

Parameter Value
Base model Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled
Method LoRA (rank 64, alpha 128)
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training examples 8,059 (6,855 short-context + 1,204 multi-turn long-context)
Epochs 3
Batch size 1 (× 32 gradient accumulation = effective 32)
Learning rate 1e-4 (cosine schedule)
Max sequence length 32,768
Eval loss 0.140
Train loss 0.229
Training time ~29.4 hours on 1× NVIDIA H200
Framework TRL SFTTrainer + PEFT

Data Pipeline

  1. Short-context data (6,855 examples): Quality-filtered synthetic tool-use conversations from smol-tools-4b training, covering all 7 scenario types at up to 4K tokens
  2. Multi-turn long-context data (1,204 examples): New synthetic conversations generated by Qwen3.5-27B teacher at 64K context, featuring 8-16 rounds of tool calls with realistic tool outputs (database results, file contents, web pages). Cleaned to remove malformed tool calls and ensure proper conversation endings
  3. Combined: 8,059 examples with a mix of short single-turn and long multi-turn conversations, filtered to ≤32K tokens

smol-tools Family

All models share the same base architecture, tool schema, and output format. Choose based on your context length needs:

Model Context Tool F1 JSON Valid No-Tool Acc Parameters HF Repo
smol-tools-4b 4K 0.955 100% 100% Rank 32, α=64 enfuse/smol-tools-4b
smol-tools-4b-16k 16K 0.948 100% 100% Rank 64, α=128 enfuse/smol-tools-4b-16k
smol-tools-4b-32k 32K 0.940 100% 100% Rank 64, α=128 this repo

How to choose:

  • 4K: Single-turn tool calls, short tool outputs — highest accuracy, lowest memory — also available in GGUF quantized formats
  • 16K: Multi-turn conversations (5-10 rounds), moderate tool outputs — also available in GGUF quantized formats
  • 32K (this model): Extended agent sessions (10-20 rounds), large tool outputs — also available in GGUF quantized formats

When to Use This Model

  • You're building an agent that needs extended multi-turn tool conversations — research assistants, code generators, data analysis pipelines with many rounds
  • Your tool outputs are very long — large database results, full source files, lengthy web scrapes that push context well beyond 16K
  • You need a small, fast model that can handle extended agent sessions without losing tool-calling accuracy
  • You want structured output you can trust at long context — 100% JSON validity even with 32K of accumulated context

When NOT to Use This Model

  • If all your tool calls are single-turn with short outputs, use smol-tools-4b (4K) — it's slightly more accurate and uses less memory
  • If your conversations stay under 16K tokens, use smol-tools-4b-16k — it's slightly more accurate and trains faster
  • If you need GPT-4-level complex multi-step planning (our weakest category at F1=0.818), use a bigger model

Limitations

  • complex_multi_step scenarios (F1=0.818) remain the weakest — the model sometimes struggles with multi-step planning involving 3+ chained tools
  • Trained on synthetic data only — real-world tool-use patterns may differ
  • The 32K training data was generated by a 27B teacher model; very long conversations (>24K tokens) may show quality degradation compared to shorter ones
  • Inherits Qwen3.5-4B base model limitations (knowledge cutoff)

Quantization Results (200-example eval)

Format Size Tool F1 Precision Recall JSON Args No-Tool
BF16 9.7 GB 0.940 0.940 0.965 100% 100% 100%
Q8_0 4.9 GB 0.918 0.917 0.955 100% 100% 100%
Q4_K_M 2.9 GB 0.925 0.925 0.955 100% 100% 100%

All formats preserve perfect JSON validity and argument correctness. The Q4_K_M quantization (3.4x smaller) retains 98% of the BF16 model's tool-calling accuracy.

Hardware

  • Training: 1× NVIDIA H200 NVL (141 GB HBM3e), ~29.4 hours
  • Inference (BF16, 32K context): Any GPU with ≥24 GB VRAM
  • Inference (BF16, 16K context): Any GPU with ≥16 GB VRAM
  • Inference (BF16, 4K context): Any GPU with ≥10 GB VRAM
  • Inference (Q8_0 GGUF, 32K context): Any device with ≥12 GB RAM — Jetson Orin AGX, consumer GPUs
  • Inference (Q8_0 GGUF, 4K context): Any device with ≥6 GB RAM
  • Inference (Q4_K_M GGUF, 32K context): Any device with ≥8 GB RAM — Jetson Orin NX, phones
  • Inference (Q4_K_M GGUF, 4K context): Any device with ≥4 GB RAM — Jetson Orin Nano, RPi 5

Attribution

Downloads last month
387
GGUF
Model size
5B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for enfuse/smol-tools-4b-32k

Evaluation results