toolася
Collection
toolcalling sft+grpo+specdecoding • 3 items • Updated
W4A16 quantized version of kenkaneki/Qwen3-8B-ToolACE.
Quantized with llm-compressor using ToolACE calibration data (domain-matched, not generic WikiText).
| Metric | BF16 | W4A16 (this) |
|---|---|---|
| BFCL simple_python | 96.25% | 96.50% |
| Model size | 16 GB | 5.7 GB |
| E2EL p50 (c=1) | 323.9 ms | 222.3 ms |
| Output tok/s (c=1) | 150.4 | 208.9 |
vllm serve kenkaneki/Qwen3-8B-ToolACE-W4A16 --quantization compressed-tensors --enable-auto-tool-choice --tool-call-parser hermes
python scripts/quantize.py --model kenkaneki/Qwen3-8B-ToolACE --method w4a16 --output ./w4a16