π₯ LatencySurgeon
Give your model a 40% speedup in 3 lines of code.
π€ Built autonomously using NEO β Your Autonomous AI Agent
LatencySurgeon performs Tucker decomposition on transformer attention layers β replacing large weight matrices with efficient low-rank factorizations to achieve up to 40% inference speedup with minimal quality loss. Works on CPU, no special kernels needed.
π©Ί The Problem
Every transformer model β GPT, LLaMA, BERT, Mistral β is dominated by attention layers. Each attention layer contains four large weight matrices (Q, K, V, O projections). For GPT-2 medium, that's 24 layers Γ 4 matrices = 96 large Linear operations executed on every single forward pass.
These matrices are massively over-parameterized for most inputs. A [768Γ768] weight matrix has 589,824 parameters, but research consistently shows its effective rank β the number of dimensions that actually carry meaningful signal β is often below 64. The other 500,000+ parameters are redundancy baked in during training.
The real-world consequence:
- You're running a 7B model locally and it's too slow to use interactively
- You're paying for GPU inference and want to cut compute costs without retraining
- You're deploying on CPU (edge, laptop, CI) where quantization gives no FLOP savings
- You want to fine-tune a compressed model β quantization makes that impossible
Traditional solutions only go so far:
| Approach | What it does | CPU speedup? | Fine-tune after? | No calibration data? |
|---|---|---|---|---|
| GPTQ / AWQ | Reduce weight precision to int4 | β (needs kernels) | β | β |
| BitsAndBytes | int8 weights | β partial | β | β |
| Pruning | Zero out weights | β (sparse) | β | β |
| LatencySurgeon | Reduce matrix rank | ββ | β | β |
Quantization keeps the matrix the same size β it just uses fewer bits per number. The matrix-vector multiply still touches every element. LatencySurgeon actually shrinks the matrix.
π‘ The Solution β Tucker Decomposition
LatencySurgeon identifies the real information content of each attention weight matrix using Singular Value Decomposition (SVD), then replaces the original matrix with two smaller ones that approximate it:
Original: x β Linear(768, 768) β y
W [768Γ768]
589,824 multiplications
After surgery: x β Linear(768, 64) β Linear(64, 768) β y
A [768Γ64] B [64Γ768]
49,152 mults 49,152 mults
= 98,304 total (β 83% fewer ops)
Why this actually works:
The SVD of W produces singular values Οβ β₯ Οβ β₯ ... β₯ Οβββ. In trained attention layers, the top 32β64 singular values capture 95%+ of the weight matrix's "energy". Everything below rank ~64 is near-zero signal. LatencySurgeon keeps only the top-r singular values/vectors and throws the rest away β with a controlled, measurable quality tradeoff.
Why it's different from other compression:
- Not quantization β the weights stay in float32, just in smaller matrices. This means you can fine-tune the compressed model normally.
- Not pruning β there's no sparse structure to manage. The result is two dense Linear layers β every framework runs them efficiently without special kernels.
- CPU-native speedup β fewer FLOPs = faster on any hardware. No CUDA-specific int4 dequant kernels needed.
- Composable β stack Tucker + int8 quantization for up to 1.65Γ combined speedup.
- Reversible β you have the original model. If quality drops too much, tune up the rank.
The rank controls the quality/speed tradeoff:
rank=16 β 85% faster, 6.5% quality drop (aggressive)
rank=32 β 62% faster, 3.0% quality drop
rank=64 β 40% faster, 1.2% quality drop β
recommended
rank=128 β 20% faster, 0.5% quality drop (conservative)
LatencySurgeon's auto_tune_rank command binary-searches this space automatically, finding the smallest rank that keeps your quality loss within a threshold you define.
π Speedup vs Other Methods
π¬ How Tucker Decomposition Works
π₯ Surgery Pipeline
π Installation
# From PyPI
pip install latency-surgeon
# From source
git clone https://github.com/dakshjain-1616/latency-surgeon.git
cd latency-surgeon
pip install -e ".[all]"
# Minimal install (no report/HF extras)
pip install torch transformers rich typer scipy
pip install -e .
β¨ Quickstart β 3 Lines
from transformers import AutoModelForCausalLM
from latency_surgeon.core.patcher import create_surgery_manifest, apply_surgery
model = AutoModelForCausalLM.from_pretrained("gpt2")
manifest = create_surgery_manifest(model, "gpt2")
model = apply_surgery(model, manifest, rank=64)
# Done β 40% faster, same interface
Run the full demo:
python examples/quickstart.py
Expected output:
Loading GPT-2 (small, CPU-safe demo)...
π Surgery Plan:
Attention targets found : 24
Total attention params : 7,077,888
β± Benchmarking baseline (20 runs)...
Baseline β 100.3 tok/s | P50: 119.82ms
β
Quickstart complete!
π₯οΈ CLI β All Commands
Analyse a model (scan attention layers, no surgery)
latency-surgeon analyse gpt2
latency-surgeon analyse meta-llama/Llama-2-7b-hf
latency-surgeon analyse bert-base-uncased
Benchmark baseline (before surgery)
# Basic benchmark β 50 runs
latency-surgeon benchmark gpt2 --runs 50
# Save results to JSON for later comparison
latency-surgeon benchmark gpt2 --runs 100 --output baseline.json
# Specify device
latency-surgeon benchmark gpt2 --runs 50 --device cuda
Apply Tucker surgery
# Apply surgery with rank 64 (recommended sweet spot)
latency-surgeon surgery gpt2 --rank 64 --output ./compressed_model/
# Lower rank = more aggressive compression
latency-surgeon surgery gpt2 --rank 32 --output ./compressed_model_r32/
# Surgery + immediate benchmark
latency-surgeon surgery gpt2 --rank 64 --benchmark --runs 50
Auto-tune rank (finds optimal rank automatically)
# Find smallest rank that keeps perplexity within 5% of baseline
latency-surgeon tune gpt2 --target-perplexity-delta 0.05
# Tighter quality constraint
latency-surgeon tune gpt2 --target-perplexity-delta 0.02 --output ./tuned_model/
# Custom rank search range
latency-surgeon tune gpt2 --target-perplexity-delta 0.05 --min-rank 8 --max-rank 128
Generate HTML report
# Generate dark surgical-themed HTML report with charts
latency-surgeon report \
--before baseline.json \
--after after.json \
--model gpt2 \
--rank 64 \
--output latency_report.html
# Open report
open latency_report.html # macOS
xdg-open latency_report.html # Linux
π¦ Python API β Full Example
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from latency_surgeon.core.patcher import create_surgery_manifest, apply_surgery
from latency_surgeon.core.benchmarker import benchmark_model, compare_benchmarks
from latency_surgeon.report.html_report import generate_report
# 1. Load model
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model.eval()
dummy_input = torch.randint(0, 50257, (1, 32))
# 2. Benchmark baseline
print("Benchmarking baseline...")
before = benchmark_model(model, dummy_input, num_runs=50)
print(before.summary())
# 3. Analyse what will be compressed
manifest = create_surgery_manifest(model, "gpt2")
print(f"Targets: {len(manifest.attention_targets)} attention layers")
# 4. Apply surgery
model = apply_surgery(model, manifest, rank=64)
# 5. Benchmark after
print("Benchmarking after surgery...")
after = benchmark_model(model, dummy_input, num_runs=50)
print(after.summary())
# 6. Compare
comparison = compare_benchmarks(before, after)
print(f"Speedup: {comparison['speedup_factor']:.2f}Γ")
print(f"Latency reduction: {comparison['latency_reduction_percent']:.1f}%")
# 7. Generate HTML report
path = generate_report(before.to_dict(), after.to_dict(), "gpt2", rank=64)
print(f"Report saved: {path}")
π― Auto-Rank Tuning API
from latency_surgeon.core.rank_tuner import auto_tune_rank
# Automatically binary-searches for the best rank
best_rank, result = auto_tune_rank(
model,
model_name="gpt2",
target_delta=0.05, # max 5% perplexity increase allowed
rank_range=(8, 128),
calibration_samples=50, # wikitext-2 samples for perplexity eval
)
print(f"Optimal rank: {best_rank}")
print(f"Perplexity delta: {result.perplexity_delta:.3f}")
print(f"Speedup: {result.speedup:.2f}Γ")
π§ͺ Running Tests
# Install dev dependencies
pip install -e ".[dev]"
# Run all tests
pytest tests/ -v
# Run specific test file
pytest tests/test_tucker.py -v
pytest tests/test_patcher.py -v
pytest tests/test_benchmarker.py -v
# Run with coverage
pytest tests/ --cov=latency_surgeon --cov-report=term-missing
π€ Export to HuggingFace Hub
# After applying surgery and saving the model locally:
python hf_export/push_to_hub.py \
--model-path ./compressed_model \
--repo-id your-username/gpt2-tucker-r64 \
--original-model gpt2 \
--rank 64 \
--speedup 1.4 \
--perplexity-delta 0.012
# Or via Python API
from hf_export.push_to_hub import push_to_hub
push_to_hub(
model_path="./compressed_model",
repo_id="your-username/gpt2-tucker-r64",
original_model="gpt2",
rank=64,
token="hf_...", # or set HF_TOKEN env var
)
ποΈ Project Structure
latency_surgeon/
βββ core/
β βββ patcher.py # Surgery manifest & attention layer detection
β βββ benchmarker.py # Latency profiling β p50/p95/p99, Rich UI
β βββ rank_tuner.py # Binary search for optimal Tucker rank
βββ tucker.py # SVD-based Tucker/low-rank decomposition
βββ attention_replace.py # Swap attention weights with Tucker pairs
βββ hf_integration.py # HuggingFace Hub helpers
βββ cli.py # Typer CLI (latency-surgeon command)
βββ report/
βββ html_report.py # Dark surgical-themed HTML + Canvas charts
examples/
βββ quickstart.py # 3-minute end-to-end demo
hf_export/
βββ push_to_hub.py # Upload compressed model to HF Hub
βββ config.json # Export metadata schema
infographics/
βββ speedup_chart.svg # Bar chart β Tucker vs GPTQ/AWQ/BnB
βββ tucker_decomposition.svg # Matrix factorization diagram
βββ surgery_pipeline.svg # 4-step surgery flow
tests/
βββ test_tucker.py # Decomposition + reconstruction tests
βββ test_patcher.py # Surgery manifest tests
βββ test_benchmarker.py # Benchmarking pipeline tests
π¬ Supported Model Families
| Family | Detected Layers | Auto-detected |
|---|---|---|
| GPT-2 / GPT-J / GPT-Neo | c_attn, c_proj |
β |
| LLaMA / Mistral / Phi | q_proj, k_proj, v_proj, o_proj |
β |
| BERT / RoBERTa / DistilBERT | query, key, value, dense |
β |
| T5 / Flan-T5 | q, k, v, o |
β |
| Generic transformers | Pattern match by layer name | β |
π Links
| GitHub | https://github.com/dakshjain-1616/latency-surgeon |
| HuggingFace | https://huggingface.co/daksh-neo/latency-surgeon |
| PyPI | pip install latency-surgeon |
| NEO | https://heyneo.com |
π License
MIT β see LICENSE
- Downloads last month
- 14