acervo-extractor-qwen3.5-9b โ€” GGUF Q4_K_M (4.7 GB)

Made autonomously using NEO โ€” your autonomous AI Agent

GGUF-quantized version of SandyVeliz/acervo-extractor-qwen3.5-9b, a 9B document-extraction model fine-tuned on structured data parsing tasks (invoices, contracts, financial reports). Quantized to Q4_K_M with llama.cpp โ€” runs on 8 GB RAM with only +6% perplexity loss and 12% faster inference than the float16 original.


Performance at a Glance

Quantization Overview

Variant File Size Peak RAM Speed Perplexity ฮ”
float16 (original) ~18 GB 20 GB 42.7 tok/s baseline
Q4_K_M โ† this repo ~4.7 GB 5.7 GB 47.8 tok/s (+12%) +6%
Q8_0 ~9.5 GB 10.7 GB 45.3 tok/s (+6%) +1%

Quality vs Speed Tradeoff

Tradeoff Chart

Each bubble represents a quantization tier. Bubble size = file size on disk. The ideal region is bottom-right (low perplexity + high speed). Q4_K_M sits at the sweet spot โ€” significant size reduction with minimal quality loss.


Memory Requirements

Memory Requirements

Q4_K_M is the recommended tier: it fits comfortably in 8 GB RAM and is the highest-quality format that does so. Q8_0 requires 12 GB. float16 needs 20 GB.


Pipeline Architecture

Pipeline

The quantization pipeline:

  1. quantize.py โ€” downloads the base model from HuggingFace, builds llama.cpp, converts to GGUF
  2. benchmark.py โ€” measures perplexity, tokens/sec, and per-token latency stats
  3. memory_estimator.py โ€” predicts peak RAM/VRAM for any (model_size, quant_type) pair
  4. compare.py โ€” benchmarks multiple models side-by-side in one run
  5. scripts/demo.py โ€” orchestrates everything end-to-end

How to Use

With llama-cpp-python

from llama_cpp import Llama

llm = Llama(
    model_path="acervo-extractor-qwen3.5-9b-Q4_K_M.gguf",
    n_ctx=2048,
    n_threads=8,
)

output = llm(
    "Extract the key financial metrics from the following document:\n\n[document text here]",
    max_tokens=256,
    temperature=0.1,
)
print(output["choices"][0]["text"])

With llama.cpp CLI

./llama-cli -m acervo-extractor-qwen3.5-9b-Q4_K_M.gguf \
  -p "Parse the following invoice and return structured JSON:" \
  -n 256 --temp 0.1

With Ollama

ollama run hf.co/[your-username]/acervo-extractor-qwen3.5-9b-gguf

Benchmark Results

Results below are from a mock / dry-run benchmark (synthetic data matching llama.cpp GGUF profiles). Real measurements on the downloaded model will vary slightly.

Perplexity

Variant Perplexity ฮ” vs float16 % Change
float16 18.4321 โ€” (baseline) โ€”
Q4_K_M 19.5380 +1.1059 +6.00%
Q8_0 18.6164 +0.1843 +1.00%

Speed & Latency

Variant Tokens/sec Mean latency P95 latency Speedup Size
float16 42.7 23.42 ms 30.15 ms baseline 100%
Q4_K_M 47.8 20.91 ms 26.92 ms 1.12ร— 26%
Q8_0 45.3 22.09 ms 28.44 ms 1.06ร— 50%

Full machine-readable results: benchmark_results.json ยท benchmark_results.csv


Memory by Quantization Type

Quant Bits/Weight File (GB) Peak RAM (GB) Fits 8 GB Fits 16 GB
float32 32.0 33.5 40.2 โœ— โœ—
float16 16.0 16.8 20.1 โœ— โœ—
Q8_0 8.5 8.9 10.7 โœ— โœ“
Q6_K 6.0 6.3 7.5 โœ“ โœ“
Q5_K_M 5.5 5.8 6.9 โœ“ โœ“
Q4_K_M โ—„ 4.5 4.7 5.7 โœ“ โœ“
Q3_K_M 3.5 3.7 4.4 โœ“ โœ“
Q2_K 2.5 2.6 3.2 โœ“ โœ“

Reproduce Locally

git clone https://github.com/dakshjain-1616/acervo-extractor-quant
cd acervo-extractor-quant
pip install -r requirements.txt

# Full quantization pipeline (requires ~20 GB disk)
python quantize.py --model SandyVeliz/acervo-extractor-qwen3.5-9b

# Dry-run benchmark (no download needed)
python scripts/demo.py --dry-run --export-csv

# Estimate RAM for your hardware
python memory_estimator.py --params 9.0

Files in This Repo

File Description
acervo-extractor-qwen3.5-9b-Q4_K_M.gguf Quantized model (upload separately โ€” large file)
acervo-extractor-qwen3.5-9b-Q8_0.gguf Higher-quality quantized model (optional)
benchmark_results.json Full benchmark results (machine-readable)
benchmark_results.csv Benchmark results (CSV)
quantization_report.md Detailed Markdown benchmark report
assets/infographic_overview.png Performance overview chart
assets/infographic_memory.png Memory requirements chart
assets/infographic_pipeline.png Pipeline architecture diagram
assets/infographic_tradeoff.png Quality vs speed vs size tradeoff

License

MIT โ€” see LICENSE


Built autonomously using NEO โ€” your autonomous AI Agent

Downloads last month
123
GGUF
Model size
9B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for daksh-neo/acervo-extractor-qwen3.5-9b-GGUF

Finetuned
Qwen/Qwen3.5-9B
Quantized
(1)
this model