acervo-extractor-qwen3.5-9b โ GGUF Q4_K_M (4.7 GB)
Made autonomously using NEO โ your autonomous AI Agent
GGUF-quantized version of SandyVeliz/acervo-extractor-qwen3.5-9b, a 9B document-extraction model fine-tuned on structured data parsing tasks (invoices, contracts, financial reports). Quantized to Q4_K_M with llama.cpp โ runs on 8 GB RAM with only +6% perplexity loss and 12% faster inference than the float16 original.
Performance at a Glance
| Variant | File Size | Peak RAM | Speed | Perplexity ฮ |
|---|---|---|---|---|
| float16 (original) | ~18 GB | 20 GB | 42.7 tok/s | baseline |
| Q4_K_M โ this repo | ~4.7 GB | 5.7 GB | 47.8 tok/s (+12%) | +6% |
| Q8_0 | ~9.5 GB | 10.7 GB | 45.3 tok/s (+6%) | +1% |
Quality vs Speed Tradeoff
Each bubble represents a quantization tier. Bubble size = file size on disk. The ideal region is bottom-right (low perplexity + high speed). Q4_K_M sits at the sweet spot โ significant size reduction with minimal quality loss.
Memory Requirements
Q4_K_M is the recommended tier: it fits comfortably in 8 GB RAM and is the highest-quality format that does so. Q8_0 requires 12 GB. float16 needs 20 GB.
Pipeline Architecture
The quantization pipeline:
quantize.pyโ downloads the base model from HuggingFace, builds llama.cpp, converts to GGUFbenchmark.pyโ measures perplexity, tokens/sec, and per-token latency statsmemory_estimator.pyโ predicts peak RAM/VRAM for any (model_size, quant_type) paircompare.pyโ benchmarks multiple models side-by-side in one runscripts/demo.pyโ orchestrates everything end-to-end
How to Use
With llama-cpp-python
from llama_cpp import Llama
llm = Llama(
model_path="acervo-extractor-qwen3.5-9b-Q4_K_M.gguf",
n_ctx=2048,
n_threads=8,
)
output = llm(
"Extract the key financial metrics from the following document:\n\n[document text here]",
max_tokens=256,
temperature=0.1,
)
print(output["choices"][0]["text"])
With llama.cpp CLI
./llama-cli -m acervo-extractor-qwen3.5-9b-Q4_K_M.gguf \
-p "Parse the following invoice and return structured JSON:" \
-n 256 --temp 0.1
With Ollama
ollama run hf.co/[your-username]/acervo-extractor-qwen3.5-9b-gguf
Benchmark Results
Results below are from a mock / dry-run benchmark (synthetic data matching llama.cpp GGUF profiles). Real measurements on the downloaded model will vary slightly.
Perplexity
| Variant | Perplexity | ฮ vs float16 | % Change |
|---|---|---|---|
| float16 | 18.4321 | โ (baseline) | โ |
| Q4_K_M | 19.5380 | +1.1059 | +6.00% |
| Q8_0 | 18.6164 | +0.1843 | +1.00% |
Speed & Latency
| Variant | Tokens/sec | Mean latency | P95 latency | Speedup | Size |
|---|---|---|---|---|---|
| float16 | 42.7 | 23.42 ms | 30.15 ms | baseline | 100% |
| Q4_K_M | 47.8 | 20.91 ms | 26.92 ms | 1.12ร | 26% |
| Q8_0 | 45.3 | 22.09 ms | 28.44 ms | 1.06ร | 50% |
Full machine-readable results: benchmark_results.json ยท benchmark_results.csv
Memory by Quantization Type
| Quant | Bits/Weight | File (GB) | Peak RAM (GB) | Fits 8 GB | Fits 16 GB |
|---|---|---|---|---|---|
| float32 | 32.0 | 33.5 | 40.2 | โ | โ |
| float16 | 16.0 | 16.8 | 20.1 | โ | โ |
| Q8_0 | 8.5 | 8.9 | 10.7 | โ | โ |
| Q6_K | 6.0 | 6.3 | 7.5 | โ | โ |
| Q5_K_M | 5.5 | 5.8 | 6.9 | โ | โ |
| Q4_K_M โ | 4.5 | 4.7 | 5.7 | โ | โ |
| Q3_K_M | 3.5 | 3.7 | 4.4 | โ | โ |
| Q2_K | 2.5 | 2.6 | 3.2 | โ | โ |
Reproduce Locally
git clone https://github.com/dakshjain-1616/acervo-extractor-quant
cd acervo-extractor-quant
pip install -r requirements.txt
# Full quantization pipeline (requires ~20 GB disk)
python quantize.py --model SandyVeliz/acervo-extractor-qwen3.5-9b
# Dry-run benchmark (no download needed)
python scripts/demo.py --dry-run --export-csv
# Estimate RAM for your hardware
python memory_estimator.py --params 9.0
Files in This Repo
| File | Description |
|---|---|
acervo-extractor-qwen3.5-9b-Q4_K_M.gguf |
Quantized model (upload separately โ large file) |
acervo-extractor-qwen3.5-9b-Q8_0.gguf |
Higher-quality quantized model (optional) |
benchmark_results.json |
Full benchmark results (machine-readable) |
benchmark_results.csv |
Benchmark results (CSV) |
quantization_report.md |
Detailed Markdown benchmark report |
assets/infographic_overview.png |
Performance overview chart |
assets/infographic_memory.png |
Memory requirements chart |
assets/infographic_pipeline.png |
Pipeline architecture diagram |
assets/infographic_tradeoff.png |
Quality vs speed vs size tradeoff |
License
MIT โ see LICENSE
Built autonomously using NEO โ your autonomous AI Agent
- Downloads last month
- 123
4-bit
Model tree for daksh-neo/acervo-extractor-qwen3.5-9b-GGUF
Base model
Qwen/Qwen3.5-9B-Base


