YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Cortex: Modular Cognitive Plug-ins for Pretrained LLMs

Surgically insert new cognitive capabilities into any pretrained transformer LLM — without retraining the base model.

Cortex is a framework for performing layer surgery on pretrained language models. It injects lightweight, composable modules into the transformer's residual stream via PyTorch hooks, adding capabilities that address fundamental LLM failure modes. The base model weights are frozen; only the Cortex modules (~3% parameter overhead) are trainable.

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                    Pretrained LLM (Frozen)                    │
│                                                               │
│  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐     │
│  │ Layer 0  │──▶│ Layer 1  │──▶│  ...    │──▶│ Layer N  │──▶  │
│  └────┬────┘   └────┬────┘   └────┬────┘   └────┬────┘     │
│       │              │              │              │           │
│  ┌────▼────┐   ┌────▼────┐   ┌────▼────┐   ┌────▼────┐     │
│  │Adaptive │   │ Backtrack│   │ Memory  │   │ Halluc  │     │
│  │ Depth   │   │  Head    │   │  Bank   │   │  Gate   │     │
│  │(gate)   │   │(correct) │   │(read/   │   │(suppress│     │
│  │         │   │          │   │ write)  │   │ unsure) │     │
│  └─────────┘   └─────────┘   └─────────┘   └─────────┘     │
│       │              │              │              │           │
│  ┌────▼────┐   ┌────▼────┐   ┌────▼────┐                    │
│  │Steering │   │  Pause   │   │         │                    │
│  │ Vector  │   │& Think   │   │         │                    │
│  │(steer)  │   │(compute) │   │         │                    │
│  └─────────┘   └─────────┘   └─────────┘                    │
└─────────────────────────────────────────────────────────────┘

The 6 Modules

1. 🧠 MemoryBank — Persistent Episodic Memory

Failure mode: Limited context window, lost long-range dependencies, no working memory.

Injects a learnable memory matrix M ∈ ℝ^{N×D} into middle transformer layers. Hidden states read from memory via multi-head cross-attention, and write back via LSTM-style gated updates. Memory persists across forward passes, enabling multi-turn memory and long-document reasoning.

Injection: POST_ATTENTION (between attention and FFN)
Mechanism: Cross-attention read → Output gate → LSTM-style write
Based on: LM2: Large Memory Models (Kang et al. 2025), WISE (Xia et al. 2024)

2. 🛡️ HallucinationGate — Confidence-Based Suppression

Failure mode: Hallucination — generating confident but factually wrong content.

A lightweight confidence probe reads the residual stream and outputs a per-token confidence score. When confidence is low, the gate suppresses the layer's residual update, pulling toward the safer prior representation. The model effectively learns to say "I don't know" at the representation level.

Injection: POST_FFN (after full transformer block)
Mechanism: Confidence probe → Soft gate → Suppress uncertain updates
Key insight: Internal states contain more information about correctness than output distributions — I(Θ; K(X)|X) ≥ I(Y; K(X)|X) + Δ
Based on: InternalInspector (Chen et al. 2024), The Map of Misbelief (2025)

3. 💭 PauseAndThink — Latent Computation Tokens

Failure mode: Fixed compute per token, shallow reasoning, limited "thinking time."

Injects K learnable "thinking" token embeddings that attend to all real tokens, perform computation, then compress their information back into the original sequence positions via gated cross-attention. Like chain-of-thought, but entirely in latent space — no extra output tokens needed.

Injection: RESIDUAL_STREAM (wraps full block)
Mechanism: Context-conditioned thinking tokens → Attention to real tokens → Gated compression back
Based on: Pause Tokens (Goyal et al. 2023), Thoughtbubbles (2025)

4. ↩️ BacktrackHead — Learned Self-Correction

Failure mode: Commitment to bad intermediate representations, no backtracking.

Monitors confidence across layers. When it detects a significant confidence drop (indicating the model went down a bad path), it applies a learned corrector network to steer the representation back toward a higher-confidence trajectory. Effectively implements architectural self-correction.

Injection: RESIDUAL_STREAM (all layers)
Mechanism: Per-layer confidence probe → Drop detection → Bottleneck corrector network
Based on: GateSkip (2025), River-LLM (2025), Self-Correction (Welleck et al. 2022)

5. 🧭 SteeringVector — Behavioral Control

Failure mode: Behavioral inflexibility, inability to control style/truthfulness/safety at runtime.

Maintains named "concept directions" in activation space. Directions can be extracted via contrastive activation analysis (RepE) or learned end-to-end. Multiple directions compose linearly with individual learnable weights. Enables runtime control of truthfulness, helpfulness, safety, and persona without retraining.

Injection: RESIDUAL_STREAM (middle layers)
Mechanism: h_new = h + layer_scale × Σ(α_i × direction_i)
Based on: Representation Engineering (Zou et al. 2023)

6. ⚡ AdaptiveDepth — Dynamic Layer Skipping

Failure mode: Fixed compute depth, wasted computation, overthinking (representation collapse).

Each layer gets a learned gate that decides per-token whether to execute or skip. Easy tokens ("the") skip many layers; hard tokens (complex reasoning) use all of them. Includes budget regularization to target a desired compute fraction.

Injection: POST_FFN (all layers)
Mechanism: Gate network → Sigmoid → Scale residual contribution → Budget loss
Based on: Mixture of Depths (Raposo et al. 2024), GateSkip (2025), Router-Tuning (2024)

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
from cortex import (
    CortexSurgeon, MemoryBank, HallucinationGate, 
    PauseAndThink, BacktrackHead, SteeringVector, AdaptiveDepth
)

# Load any pretrained LLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")

# Create surgeon
surgeon = CortexSurgeon(model)
hidden_dim = surgeon.hidden_dim

# Add modules — each targets specific layers
surgeon.add_module("memory", MemoryBank(hidden_dim=hidden_dim, num_slots=64))
surgeon.add_module("halluc_gate", HallucinationGate(hidden_dim=hidden_dim))
surgeon.add_module("pause_think", PauseAndThink(hidden_dim=hidden_dim, num_think_tokens=8))
surgeon.add_module("backtrack", BacktrackHead(hidden_dim=hidden_dim, num_layers=surgeon.num_layers))
surgeon.add_module("steering", SteeringVector(hidden_dim=hidden_dim, num_directions=4))
surgeon.add_module("adaptive_depth", AdaptiveDepth(hidden_dim=hidden_dim))

# Perform surgery (freezes base model, only Cortex modules train)
surgeon.operate(freeze_base=True)

# Use the model normally — Cortex modules are active
inputs = tokenizer("The meaning of life is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

# Toggle modules on/off at runtime
surgeon.modules["halluc_gate"].disable()

# Save only Cortex weights (~3% of model size)
surgeon.save_cortex_modules("cortex_weights.pt")

Benchmark Harness

Cortex includes a comprehensive benchmark harness for comparing base LLMs against Cortex-enhanced versions. It evaluates across standard NLP benchmarks and Cortex-specific capability tests.

Standard Benchmarks

Task	Type	Choices	Dataset	Few-Shot
HellaSwag	Commonsense NLI	4	`Rowan/hellaswag`	5-shot
ARC-Easy	Science QA	3-5	`allenai/ai2_arc`	5-shot
ARC-Challenge	Science QA (hard)	3-5	`allenai/ai2_arc`	5-shot
PIQA	Physical intuition	2	`gimmaru/piqa`	0-shot
WinoGrande	Coreference	2	`allenai/winogrande`	5-shot
MMLU	Multi-domain knowledge	4	`cais/mmlu`	5-shot
HaluEval	Hallucination detection	2	`pminervini/HaluEval`	0-shot

Cortex-Specific Benchmarks

Task	Tests	Method
Passkey Retrieval	Long-context memory, attention to details	Generation + substring match at 128/256/512/1024 token contexts
Multi-Hop Memory	Compositional reasoning, fact chaining	Generation + answer extraction from 3-hop fact chains

Running Benchmarks

# Quick test (10 examples per task)
python -m benchmark.run_benchmark --n 10 --tasks hellaswag piqa

# Standard suite (50 examples, default tasks)
python -m benchmark.run_benchmark --n 50

# Full evaluation with all tasks
python -m benchmark.run_benchmark --n 0 --tasks hellaswag piqa arc-easy arc-challenge winogrande mmlu

# Custom model
python -m benchmark.run_benchmark --model meta-llama/Llama-3.2-1B --n 50

# Save JSON results
python -m benchmark.run_benchmark --n 50 --output results.json

# Skip memory benchmarks
python -m benchmark.run_benchmark --n 50 --no-memory

# Custom passkey test
python -m benchmark.run_benchmark --n 20 --passkey-lengths 128 256 512 1024 --n-passkey 10

python -m benchmark.run_benchmark --n 10 --model meta-llama/Llama-3.2-1B --tasks hellaswag piqa arc-easy arc-challenge winogrande mmlu

Scoring Method

Multiple-choice tasks: Log-likelihood scoring — computes average log-probability the model assigns to each continuation, picks the highest. This is the standard approach used by lm-evaluation-harness and Open LLM Leaderboard.
Generation tasks: Greedy decode + substring match against expected answer.

Example Output (Llama-3.2-1B, n=10)

======================================================================
BENCHMARK SUMMARY: meta-llama/Llama-3.2-1B
n=10 per task, device=mps
======================================================================

Task                       Base   Cortex    Delta
--------------------------------------------------
hellaswag                0.6000   0.6000  +0.0000
piqa                     0.2000   0.2000  +0.0000
arc-easy                 0.4000   0.4000  +0.0000
arc-challenge            0.5000   0.5000  +0.0000
winogrande               0.6000   0.6000  +0.0000
mmlu                     0.4000   0.4000  +0.0000
passkey                  1.0000   1.0000  +0.0000
multi_hop                1.0000   1.0000  +0.0000

Cortex overhead: 53,708,968 params (4.35%)
======================================================================

Note: Cortex modules are untrained at injection and initialize as exact no-ops for model behavior. Freshly injected modules should match the base model; positive deltas require Cortex-specific training or calibrated steering directions.

Programmatic Usage

from benchmark.runner import BenchmarkRunner

runner = BenchmarkRunner(model_name="HuggingFaceTB/SmolLM2-135M")
results = runner.run_comparison(
    tasks=["hellaswag", "piqa", "arc-easy"],
    n=50,
    include_memory=True,
    passkey_lengths=[128, 256, 512],
)
BenchmarkRunner.print_summary(results)

Design Principles

1. Zero-Init for Stable Injection

All modules initialize their output projections to zero (or near-zero via negative gate biases). This means at injection time, the model behaves identically to the original — Cortex modules are "invisible" and gradually learn to contribute during training.

2. Hook-Based Surgery

Modules are injected via PyTorch register_forward_hook / register_forward_pre_hook. No model code is modified. This works with any HuggingFace transformers model that has a standard layer structure.

3. Shared Parameters Across Layers

Each module instance is shared across its target layers. A single MemoryBank object handles all middle layers, keeping parameter count low.

4. Base Model Freezing

By default, all base model parameters are frozen. Only Cortex module parameters are trainable. This means:

No catastrophic forgetting of the base model's capabilities
Tiny training cost (~3% of parameters)
Multiple Cortex configurations can be saved/loaded/swapped

5. Composability

All modules are independent and composable. Use any combination:

Memory + HallucinationGate for factual QA
PauseAndThink + AdaptiveDepth for reasoning tasks
SteeringVector alone for behavioral control

Injection Points

Point	Location	Best For
`PRE_ATTENTION`	Before self-attention	Input preprocessing, prefix injection
`POST_ATTENTION`	After attention, before FFN	Memory augmentation (reads enhance attention output)
`PRE_FFN`	Before FFN	Gate what the FFN processes
`POST_FFN`	After full block	Gating, confidence estimation
`RESIDUAL_STREAM`	Wraps entire block	Steering vectors, thinking tokens, backtracking

Layer Targeting

# Target specific layers
surgeon.add_module("mod", module, target_layers=[0, 1, 2, 3])

# Target all layers
surgeon.add_module("mod", module, target_layers="all")

# Target middle third (best for steering/memory)
surgeon.add_module("mod", module, target_layers="middle")

# Target deep layers (best for output-facing modifications)
surgeon.add_module("mod", module, target_layers="deep")

Compatible Models

Tested and working with any model using the standard model.layers[i] structure:

LLaMA family (LLaMA 2/3, CodeLLaMA)
Mistral / Mixtral
Qwen2
Gemma
Phi
SmolLM
GPT-2 / GPT-Neo (uses transformer.h)

Monitoring

# Hallucination confidence
confidence = surgeon.modules["halluc_gate"].get_confidence()

# Backtracking status
was_triggered = surgeon.modules["backtrack"].was_triggered()
confidence_per_layer = surgeon.modules["backtrack"].get_confidence_history()

# Adaptive depth statistics
gate_stats = surgeon.modules["adaptive_depth"].get_gate_stats()
print(f"Mean gate: {gate_stats['mean']:.3f}, Skip fraction: {gate_stats['skip_frac']:.3f}")

# Steering vector info
for name, info in surgeon.modules["steering"].get_direction_info().items():
    print(f"{name}: alpha={info['alpha']:.3f}")

# Parameter report
report = surgeon.get_parameter_report()

Extracting Steering Directions (RepE)

# Contrastive activation pairs for "truthfulness"
positive = [
    "I know for certain that the Earth orbits the Sun.",
    "Scientific evidence clearly shows vaccines are safe.",
    "I don't know the answer to that question.",
]
negative = [
    "The Earth is flat and NASA is lying.",
    "Vaccines cause autism according to my research.",
    "I'm absolutely certain about this made-up fact.",
]

# Extract direction from layer 15
direction = SteeringVector.extract_direction(
    model, positive, negative, tokenizer, 
    layer_idx=15, device="cuda"
)

# Set it in the steering module
surgeon.modules["steering"].set_direction("truthfulness", direction, alpha=10.0)

Training

For benchmark-style supervised tuning, use the training CLI. It freezes the base model, injects Cortex modules, optimizes only Cortex parameters, and saves the adapter weights:

python -m benchmark.train_cortex \
  --model meta-llama/Llama-3.2-1B \
  --tasks hellaswag piqa arc-easy winogrande \
  --n-train 32 \
  --epochs 1 \
  --output cortex_tuned.pt

python -m benchmark.run_benchmark \
  --model meta-llama/Llama-3.2-1B \
  --cortex-weights cortex_tuned.pt \
  --n 50

For custom training loops:

import torch.optim as optim

# Only train Cortex parameters
optimizer = optim.AdamW(surgeon.get_trainable_parameters(), lr=1e-4)

for batch in dataloader:
    outputs = model(**batch)
    loss = outputs.loss
    
    # Add adaptive depth budget loss
    loss = loss + surgeon.modules["adaptive_depth"].get_budget_loss()
    
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Test Results (SmolLM2-135M)

✓ All 9 tests passed
✓ 4,286,918 Cortex params (3.19% overhead on 134M model)
✓ Base model: 0 gradients (fully frozen)
✓ Cortex modules: gradients flowing
✓ Enable/disable: exact zero diff when disabled
✓ Generation: produces coherent output
✓ Save/load: 16.4 KB checkpoint

Citation

If you use Cortex in your research, please cite the papers that inspired each module:

@article{kang2025lm2,
  title={LM2: Large Memory Models},
  author={Kang, et al.},
  journal={arXiv:2502.06049},
  year={2025}
}
@article{chen2024internalinspector,
  title={InternalInspector I2: Robust Confidence Estimation in LLMs through Internal States},
  author={Chen, et al.},
  journal={arXiv:2406.12053},
  year={2024}
}
@article{goyal2023think,
  title={Think before you speak: Training Language Models With Pause Tokens},
  author={Goyal, et al.},
  journal={arXiv:2310.02226},
  year={2023}
}
@article{zou2023representation,
  title={Representation Engineering: A Top-Down Approach to AI Transparency},
  author={Zou, et al.},
  journal={arXiv:2310.01405},
  year={2023}
}
@article{raposo2024mixture,
  title={Mixture of Depths},
  author={Raposo, et al.},
  journal={arXiv:2404.02258},
  year={2024}
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for theapemachine/cortex

River-LLM: Large Language Model Seamless Exit Based on KV Share

Paper • 2604.18396 • Published 9 days ago • 6

The Map of Misbelief: Tracing Intrinsic and Extrinsic Hallucinations Through Attention Patterns

Paper • 2511.10837 • Published Nov 13, 2025 • 1