πŸ₯ LatencySurgeon

Give your model a 40% speedup in 3 lines of code.

πŸ€– Built autonomously using NEO β€” Your Autonomous AI Agent

NEO VS Code Extension Python License: MIT HuggingFace GitHub

LatencySurgeon performs Tucker decomposition on transformer attention layers β€” replacing large weight matrices with efficient low-rank factorizations to achieve up to 40% inference speedup with minimal quality loss. Works on CPU, no special kernels needed.


🩺 The Problem

Every transformer model β€” GPT, LLaMA, BERT, Mistral β€” is dominated by attention layers. Each attention layer contains four large weight matrices (Q, K, V, O projections). For GPT-2 medium, that's 24 layers Γ— 4 matrices = 96 large Linear operations executed on every single forward pass.

These matrices are massively over-parameterized for most inputs. A [768Γ—768] weight matrix has 589,824 parameters, but research consistently shows its effective rank β€” the number of dimensions that actually carry meaningful signal β€” is often below 64. The other 500,000+ parameters are redundancy baked in during training.

The real-world consequence:

  • You're running a 7B model locally and it's too slow to use interactively
  • You're paying for GPU inference and want to cut compute costs without retraining
  • You're deploying on CPU (edge, laptop, CI) where quantization gives no FLOP savings
  • You want to fine-tune a compressed model β€” quantization makes that impossible

Traditional solutions only go so far:

Approach What it does CPU speedup? Fine-tune after? No calibration data?
GPTQ / AWQ Reduce weight precision to int4 βœ— (needs kernels) βœ— βœ—
BitsAndBytes int8 weights βœ“ partial βœ— βœ“
Pruning Zero out weights βœ“ (sparse) βœ“ βœ—
LatencySurgeon Reduce matrix rank βœ“βœ“ βœ“ βœ“

Quantization keeps the matrix the same size β€” it just uses fewer bits per number. The matrix-vector multiply still touches every element. LatencySurgeon actually shrinks the matrix.


πŸ’‘ The Solution β€” Tucker Decomposition

LatencySurgeon identifies the real information content of each attention weight matrix using Singular Value Decomposition (SVD), then replaces the original matrix with two smaller ones that approximate it:

Original:   x  β†’  Linear(768, 768)  β†’  y
                    W  [768Γ—768]
                  589,824 multiplications

After surgery:  x  β†’  Linear(768, 64)  β†’  Linear(64, 768)  β†’  y
                         A [768Γ—64]            B [64Γ—768]
                       49,152 mults          49,152 mults
                              = 98,304 total  (↓ 83% fewer ops)

Why this actually works:

The SVD of W produces singular values σ₁ β‰₯ Οƒβ‚‚ β‰₯ ... β‰₯ Οƒβ‚‡β‚†β‚ˆ. In trained attention layers, the top 32–64 singular values capture 95%+ of the weight matrix's "energy". Everything below rank ~64 is near-zero signal. LatencySurgeon keeps only the top-r singular values/vectors and throws the rest away β€” with a controlled, measurable quality tradeoff.

Why it's different from other compression:

  • Not quantization β€” the weights stay in float32, just in smaller matrices. This means you can fine-tune the compressed model normally.
  • Not pruning β€” there's no sparse structure to manage. The result is two dense Linear layers β€” every framework runs them efficiently without special kernels.
  • CPU-native speedup β€” fewer FLOPs = faster on any hardware. No CUDA-specific int4 dequant kernels needed.
  • Composable β€” stack Tucker + int8 quantization for up to 1.65Γ— combined speedup.
  • Reversible β€” you have the original model. If quality drops too much, tune up the rank.

The rank controls the quality/speed tradeoff:

rank=16  β†’  85% faster,  6.5% quality drop   (aggressive)
rank=32  β†’  62% faster,  3.0% quality drop
rank=64  β†’  40% faster,  1.2% quality drop   β˜… recommended
rank=128 β†’  20% faster,  0.5% quality drop   (conservative)

LatencySurgeon's auto_tune_rank command binary-searches this space automatically, finding the smallest rank that keeps your quality loss within a threshold you define.


πŸ“Š Speedup vs Other Methods

Speedup comparison chart

πŸ”¬ How Tucker Decomposition Works

Tucker decomposition visual

πŸ₯ Surgery Pipeline

Surgery pipeline

πŸš€ Installation

# From PyPI
pip install latency-surgeon

# From source
git clone https://github.com/dakshjain-1616/latency-surgeon.git
cd latency-surgeon
pip install -e ".[all]"

# Minimal install (no report/HF extras)
pip install torch transformers rich typer scipy
pip install -e .

✨ Quickstart β€” 3 Lines

from transformers import AutoModelForCausalLM
from latency_surgeon.core.patcher import create_surgery_manifest, apply_surgery

model = AutoModelForCausalLM.from_pretrained("gpt2")
manifest = create_surgery_manifest(model, "gpt2")
model = apply_surgery(model, manifest, rank=64)
# Done β€” 40% faster, same interface

Run the full demo:

python examples/quickstart.py

Expected output:

Loading GPT-2 (small, CPU-safe demo)...
πŸ” Surgery Plan:
   Attention targets found : 24
   Total attention params  : 7,077,888
⏱  Benchmarking baseline (20 runs)...
   Baseline  β†’ 100.3 tok/s | P50: 119.82ms
βœ… Quickstart complete!

πŸ–₯️ CLI β€” All Commands

Analyse a model (scan attention layers, no surgery)

latency-surgeon analyse gpt2
latency-surgeon analyse meta-llama/Llama-2-7b-hf
latency-surgeon analyse bert-base-uncased

Benchmark baseline (before surgery)

# Basic benchmark β€” 50 runs
latency-surgeon benchmark gpt2 --runs 50

# Save results to JSON for later comparison
latency-surgeon benchmark gpt2 --runs 100 --output baseline.json

# Specify device
latency-surgeon benchmark gpt2 --runs 50 --device cuda

Apply Tucker surgery

# Apply surgery with rank 64 (recommended sweet spot)
latency-surgeon surgery gpt2 --rank 64 --output ./compressed_model/

# Lower rank = more aggressive compression
latency-surgeon surgery gpt2 --rank 32 --output ./compressed_model_r32/

# Surgery + immediate benchmark
latency-surgeon surgery gpt2 --rank 64 --benchmark --runs 50

Auto-tune rank (finds optimal rank automatically)

# Find smallest rank that keeps perplexity within 5% of baseline
latency-surgeon tune gpt2 --target-perplexity-delta 0.05

# Tighter quality constraint
latency-surgeon tune gpt2 --target-perplexity-delta 0.02 --output ./tuned_model/

# Custom rank search range
latency-surgeon tune gpt2 --target-perplexity-delta 0.05 --min-rank 8 --max-rank 128

Generate HTML report

# Generate dark surgical-themed HTML report with charts
latency-surgeon report \
  --before baseline.json \
  --after after.json \
  --model gpt2 \
  --rank 64 \
  --output latency_report.html

# Open report
open latency_report.html        # macOS
xdg-open latency_report.html    # Linux

πŸ“¦ Python API β€” Full Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from latency_surgeon.core.patcher import create_surgery_manifest, apply_surgery
from latency_surgeon.core.benchmarker import benchmark_model, compare_benchmarks
from latency_surgeon.report.html_report import generate_report

# 1. Load model
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model.eval()

dummy_input = torch.randint(0, 50257, (1, 32))

# 2. Benchmark baseline
print("Benchmarking baseline...")
before = benchmark_model(model, dummy_input, num_runs=50)
print(before.summary())

# 3. Analyse what will be compressed
manifest = create_surgery_manifest(model, "gpt2")
print(f"Targets: {len(manifest.attention_targets)} attention layers")

# 4. Apply surgery
model = apply_surgery(model, manifest, rank=64)

# 5. Benchmark after
print("Benchmarking after surgery...")
after = benchmark_model(model, dummy_input, num_runs=50)
print(after.summary())

# 6. Compare
comparison = compare_benchmarks(before, after)
print(f"Speedup:          {comparison['speedup_factor']:.2f}Γ—")
print(f"Latency reduction: {comparison['latency_reduction_percent']:.1f}%")

# 7. Generate HTML report
path = generate_report(before.to_dict(), after.to_dict(), "gpt2", rank=64)
print(f"Report saved: {path}")

🎯 Auto-Rank Tuning API

from latency_surgeon.core.rank_tuner import auto_tune_rank

# Automatically binary-searches for the best rank
best_rank, result = auto_tune_rank(
    model,
    model_name="gpt2",
    target_delta=0.05,        # max 5% perplexity increase allowed
    rank_range=(8, 128),
    calibration_samples=50,   # wikitext-2 samples for perplexity eval
)
print(f"Optimal rank: {best_rank}")
print(f"Perplexity delta: {result.perplexity_delta:.3f}")
print(f"Speedup: {result.speedup:.2f}Γ—")

πŸ§ͺ Running Tests

# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_tucker.py -v
pytest tests/test_patcher.py -v
pytest tests/test_benchmarker.py -v

# Run with coverage
pytest tests/ --cov=latency_surgeon --cov-report=term-missing

πŸ“€ Export to HuggingFace Hub

# After applying surgery and saving the model locally:
python hf_export/push_to_hub.py \
  --model-path ./compressed_model \
  --repo-id your-username/gpt2-tucker-r64 \
  --original-model gpt2 \
  --rank 64 \
  --speedup 1.4 \
  --perplexity-delta 0.012
# Or via Python API
from hf_export.push_to_hub import push_to_hub

push_to_hub(
    model_path="./compressed_model",
    repo_id="your-username/gpt2-tucker-r64",
    original_model="gpt2",
    rank=64,
    token="hf_...",  # or set HF_TOKEN env var
)

πŸ—οΈ Project Structure

latency_surgeon/
β”œβ”€β”€ core/
β”‚   β”œβ”€β”€ patcher.py        # Surgery manifest & attention layer detection
β”‚   β”œβ”€β”€ benchmarker.py    # Latency profiling β€” p50/p95/p99, Rich UI
β”‚   └── rank_tuner.py     # Binary search for optimal Tucker rank
β”œβ”€β”€ tucker.py             # SVD-based Tucker/low-rank decomposition
β”œβ”€β”€ attention_replace.py  # Swap attention weights with Tucker pairs
β”œβ”€β”€ hf_integration.py     # HuggingFace Hub helpers
β”œβ”€β”€ cli.py                # Typer CLI (latency-surgeon command)
└── report/
    └── html_report.py    # Dark surgical-themed HTML + Canvas charts
examples/
└── quickstart.py         # 3-minute end-to-end demo
hf_export/
β”œβ”€β”€ push_to_hub.py        # Upload compressed model to HF Hub
└── config.json           # Export metadata schema
infographics/
β”œβ”€β”€ speedup_chart.svg     # Bar chart β€” Tucker vs GPTQ/AWQ/BnB
β”œβ”€β”€ tucker_decomposition.svg  # Matrix factorization diagram
└── surgery_pipeline.svg  # 4-step surgery flow
tests/
β”œβ”€β”€ test_tucker.py        # Decomposition + reconstruction tests
β”œβ”€β”€ test_patcher.py       # Surgery manifest tests
└── test_benchmarker.py   # Benchmarking pipeline tests

πŸ”¬ Supported Model Families

Family Detected Layers Auto-detected
GPT-2 / GPT-J / GPT-Neo c_attn, c_proj βœ…
LLaMA / Mistral / Phi q_proj, k_proj, v_proj, o_proj βœ…
BERT / RoBERTa / DistilBERT query, key, value, dense βœ…
T5 / Flan-T5 q, k, v, o βœ…
Generic transformers Pattern match by layer name βœ…

πŸ”— Links


πŸ“„ License

MIT β€” see LICENSE


πŸ₯ Built with NEO β€” Your Autonomous AI Agent
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support