acervo-extractor-qwen3.5-9b — GGUF Q4_K_M (4.7 GB)

Made autonomously using NEO — your autonomous AI Agent

GGUF-quantized version of SandyVeliz/acervo-extractor-qwen3.5-9b, a 9B document-extraction model fine-tuned on structured data parsing tasks (invoices, contracts, financial reports). Quantized to Q4_K_M with llama.cpp — runs on 8 GB RAM with only +6% perplexity loss and 12% faster inference than the float16 original.

Performance at a Glance

Variant	File Size	Peak RAM	Speed	Perplexity Δ
float16 (original)	~18 GB	20 GB	42.7 tok/s	baseline
Q4_K_M ← this repo	~4.7 GB	5.7 GB	47.8 tok/s (+12%)	+6%
Q8_0	~9.5 GB	10.7 GB	45.3 tok/s (+6%)	+1%

Quality vs Speed Tradeoff

Each bubble represents a quantization tier. Bubble size = file size on disk. The ideal region is bottom-right (low perplexity + high speed). Q4_K_M sits at the sweet spot — significant size reduction with minimal quality loss.

Memory Requirements

Q4_K_M is the recommended tier: it fits comfortably in 8 GB RAM and is the highest-quality format that does so. Q8_0 requires 12 GB. float16 needs 20 GB.

Pipeline Architecture

The quantization pipeline:

quantize.py — downloads the base model from HuggingFace, builds llama.cpp, converts to GGUF
benchmark.py — measures perplexity, tokens/sec, and per-token latency stats
memory_estimator.py — predicts peak RAM/VRAM for any (model_size, quant_type) pair
compare.py — benchmarks multiple models side-by-side in one run
scripts/demo.py — orchestrates everything end-to-end

How to Use

With llama-cpp-python

from llama_cpp import Llama

llm = Llama(
    model_path="acervo-extractor-qwen3.5-9b-Q4_K_M.gguf",
    n_ctx=2048,
    n_threads=8,
)

output = llm(
    "Extract the key financial metrics from the following document:\n\n[document text here]",
    max_tokens=256,
    temperature=0.1,
)
print(output["choices"][0]["text"])

With llama.cpp CLI

./llama-cli -m acervo-extractor-qwen3.5-9b-Q4_K_M.gguf \
  -p "Parse the following invoice and return structured JSON:" \
  -n 256 --temp 0.1

With Ollama

ollama run hf.co/[your-username]/acervo-extractor-qwen3.5-9b-gguf

Benchmark Results

Results below are from a mock / dry-run benchmark (synthetic data matching llama.cpp GGUF profiles). Real measurements on the downloaded model will vary slightly.

Perplexity

Variant	Perplexity	Δ vs float16	% Change
float16	18.4321	— (baseline)	—
Q4_K_M	19.5380	+1.1059	+6.00%
Q8_0	18.6164	+0.1843	+1.00%

Speed & Latency

Variant	Tokens/sec	Mean latency	P95 latency	Speedup	Size
float16	42.7	23.42 ms	30.15 ms	baseline	100%
Q4_K_M	47.8	20.91 ms	26.92 ms	1.12×	26%
Q8_0	45.3	22.09 ms	28.44 ms	1.06×	50%

Full machine-readable results: benchmark_results.json · benchmark_results.csv

Memory by Quantization Type

Quant	Bits/Weight	File (GB)	Peak RAM (GB)	Fits 8 GB	Fits 16 GB
float32	32.0	33.5	40.2	✗	✗
float16	16.0	16.8	20.1	✗	✗
Q8_0	8.5	8.9	10.7	✗	✓
Q6_K	6.0	6.3	7.5	✓	✓
Q5_K_M	5.5	5.8	6.9	✓	✓
Q4_K_M ◄	4.5	4.7	5.7	✓	✓
Q3_K_M	3.5	3.7	4.4	✓	✓
Q2_K	2.5	2.6	3.2	✓	✓

Reproduce Locally

git clone https://github.com/dakshjain-1616/acervo-extractor-quant
cd acervo-extractor-quant
pip install -r requirements.txt

# Full quantization pipeline (requires ~20 GB disk)
python quantize.py --model SandyVeliz/acervo-extractor-qwen3.5-9b

# Dry-run benchmark (no download needed)
python scripts/demo.py --dry-run --export-csv

# Estimate RAM for your hardware
python memory_estimator.py --params 9.0

Files in This Repo

File	Description
`acervo-extractor-qwen3.5-9b-Q4_K_M.gguf`	Quantized model (upload separately — large file)
`acervo-extractor-qwen3.5-9b-Q8_0.gguf`	Higher-quality quantized model (optional)
`benchmark_results.json`	Full benchmark results (machine-readable)
`benchmark_results.csv`	Benchmark results (CSV)
`quantization_report.md`	Detailed Markdown benchmark report
`assets/infographic_overview.png`	Performance overview chart
`assets/infographic_memory.png`	Memory requirements chart
`assets/infographic_pipeline.png`	Pipeline architecture diagram
`assets/infographic_tradeoff.png`	Quality vs speed vs size tradeoff

License

MIT — see LICENSE

Built autonomously using NEO — your autonomous AI Agent

Downloads last month: 123

GGUF

Model size

9B params

Architecture

qwen35

Hardware compatibility

4-bit

Model tree for daksh-neo/acervo-extractor-qwen3.5-9b-GGUF

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Quantized

SandyVeliz/acervo-extractor-qwen3.5-9b

Quantized

(1)

this model