Chimère: A Self-Improving MoE Inference System for Consumer Hardware

35B parameters. 80 tokens/second on the production HTTP path. One GPU. $0.10/day. The model improves while you sleep.

🆕 Latest update — April 2026: Step 7 multi-architecture dispatch. The same chimere-server runtime now also runs Mamba-2 / Nemotron-H MoE hybrid SSM models end-to-end, on top of a custom backport of upstream llama.cpp's Mamba-2 work into our ik_llama.cpp fork (offered upstream as PR #1593). NVIDIA Nemotron-3-Nano-30B-A3B Q4_0 measured at ~45 tok/s on RTX 5060 Ti (sm_120, NCMOE=30, ctx 2048). Qwen3.5 production path is byte-for-byte unchanged. See Multi-architecture support below.

What is Chimère?

Chimère is a complete inference system — not just a model or a runtime, but an integrated stack where every component feeds the others. It runs Qwen3.5-35B-A3B (35B total parameters, 3.5B active per token, 256 experts) on a single RTX 5060 Ti (16 GB VRAM) at **80 tok/s on the chimere-server HTTP production path** (the bare ik_llama backend reaches ~93 tok/s; the Rust HTTP / sampling layer adds the difference). A nightly quality loop further improves the system from production traffic.

This is the kind of system NVIDIA builds for enterprise deployments, except it runs on a desktop in the south of France.

User request
     │
     ▼
┌─────────────────────────────────────────────────────────┐
│  ODO — Unified Orchestrator (port 8084)                 │
│  ┌─────────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │ Intent      │  │ Entropy      │  │ Confidence    │  │
│  │ Classifier  │  │ Router       │  │ RAG Trigger   │  │
│  │ (3-cascade) │  │ (fast/qual/  │  │ (logprob      │  │
│  │             │  │  ultra)      │  │  probe)       │  │
│  └──────┬──────┘  └──────┬───────┘  └──────┬────────┘  │
│         │                │                  │           │
│  ┌──────▼──────────────────────────────────▼────────┐  │
│  │ Enrichment Pipeline                               │  │
│  │ • Web Search SOTA (8-stage: expand→search→RRF→    │  │
│  │   fetch→chunk→rerank→CRAG→synthesize)             │  │
│  │ • ChromaDB RAG (dense + BM25 + RRF + cross-enc)  │  │
│  │ • FAISS Semantic Few-shot (per domain)            │  │
│  │ • Dynamic Engram (web → n-gram logit bias)        │  │
│  │ • Tool Injection (auto, from pipeline YAML)       │  │
│  └──────┬────────────────────────────────────────────┘  │
│         │                                               │
│  ┌──────▼────────┐  ┌────────────────┐                  │
│  │ DVTS Tree     │  │ ABF + CGRS     │                  │
│  │ Search (K=2,  │  │ (thinking      │                  │
│  │ ThinkPRM)     │  │  budget mgmt)  │                  │
│  └──────┬────────┘  └──────┬─────────┘                  │
└─────────┼──────────────────┼────────────────────────────┘
          │                  │
          ▼                  ▼
┌─────────────────────────────────────────────────────────┐
│  chimere-server (port 8081) — Rust Runtime              │
│  • ik_llama FFI backend (93 tok/s)                      │
│  • Multi-tier Engram (Cuckoo <10ns → hash O(1) → FAISS)│
│  • Logprobs (top-5 log-softmax, real values)            │
│  • ABF token 248069 forcing at budget threshold         │
│  • IQ3_S custom-mix / RAMP-v2 (15.2 GB, 3.78 BPW)     │
│  • KV cache: q8_0 keys + q4_0 values (sweet spot)      │
└──────────────────────┬──────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────┐
│  Quality & Self-Improvement Loop                        │
│  • ThinkPRM-1.5B (CPU): step-level verification         │
│  • quality_scores.jsonl (104 scores, mean 3.04/5)       │
│  • training_pairs.jsonl (68 pairs, score ≥ 4)           │
│  • 03:00 — Nightly LoRA (MeZO, stops GGUF→train→restart)│
│  • 04:00 — Engram WRITE (quality-gated, decay >30d)     │
│  • Mon 02:00 — DSPy MIPROv2 prompt optimization         │
│  • 6h — ChromaDB RAG reindex                            │
└─────────────────────────────────────────────────────────┘

Why This Is SOTA for Consumer Hardware (March 2026)

vs. Existing Solutions

System	What it does	What Chimère adds
llama.cpp / ik_llama	Serve GGUF models	We use ik_llama as backend (+23% vs stock), add Engram, ABF, quality loop
KTransformers	CPU/GPU co-serving for MoE	We go further: per-tensor mixed-precision (RAMP), self-improving quality
Ollama / LM Studio	Easy local LLM UI	No orchestration, no quality loop, no domain memory, no nightly improvement
vLLM / TensorRT-LLM	High-throughput serving	Requires A100+, no consumer GPU support for 35B MoE
OpenRouter / Together AI	API access to MoE models	Cloud, not local. $0.60/M tokens vs $0.10/day
Autoresearch (Karpathy)	Self-improving research agent	Concept paper, not deployed system. No runtime, no quantization
DeepSeek Engram	Conditional memory for MoE	Paper only. We implemented multi-tier version with quality-gated write

What No One Else Has Combined

Rust runtime + custom CUDA sm_120 kernels for Blackwell consumer GPUs — 56K lines, the only MoE runtime in Rust
RAMP data-free quantization — per-tensor mixed-precision without calibration data, 15.2 GB GGUF
Multi-tier Engram — Cuckoo filter (<10ns) → N-gram hash (O(1)) → FAISS semantic (~5ms), with quality-gated nightly writes
Adaptive Budget Forcing — thinking budget management for quantized reasoning (IQ3_S produces less coherent thinking than BF16, so budget must be shorter)
The quality loop — ThinkPRM scores every response, high-quality pairs feed nightly LoRA + Engram + DSPy. The system improves from production traffic.
Honest negative results — we tried and documented why speculative decoding (DFlash τ=6.06, wall-clock 0.73×), MTP (84.8% acceptance, 0.51×), and expert prefetch (86.65% hit, +1.1%) don't help on this hardware.

The Components

1. ODO — Unified Orchestrator

chimere-odo | 17K lines Python

A single proxy between the user and the model that adds intelligence:

Intent Classification: 3-strategy cascade (regex 99% <1ms → filetype → LLM GBNF <50ms). Routes to: code, kine, cyber, research, default.
Entropy Router: Measures query complexity → fast (no-think, 0.7 temp), quality (think, 2048 budget, ABF 0.55), ultra (DVTS K=2, ThinkPRM).
Enrichment: Web search (8-stage SOTA pipeline), ChromaDB RAG (dense+BM25+RRF+cross-encoder), semantic few-shot, dynamic Engram.
Quality Gate: ThinkPRM-1.5B verifies step-level reasoning. Score ≤ 2 → retry with reflection.
Pipeline YAML: Define multi-step agent workflows with hot-reload. 5 pipelines shipped (code: architect→coder, cyber: triage→correlate→remediate).

Why not LangChain/LlamaIndex? Too heavy, too many abstractions, designed for cloud APIs. ODO is 1,525 lines doing exactly what we need with zero external framework dependencies.

2. Engram — Multi-Tier Domain Memory

chimere-engram-tables

Inspired by DeepSeek Engram but implemented as a lightweight lookup system:

Tier 0 — Cuckoo filter (<10ns): skips 97% of lookups for tokens not in any table
Tier 1 — FNV-1a hash tables (O(1)): core N-gram matching, binary format compatible with Rust and Python
Tier 2 — FAISS semantic (~5ms): embedding-based few-shot example retrieval

Ablation results (measured on 10-question benchmark):

Engram v1 (α=0.35, think+response):  77%  ← biases thinking, DEGRADES
Engram OFF (α=0):                    85%  ← baseline
Engram v2 (α=0.1, response-only):   88%  ← PRODUCTION

Key insight: applying Engram bias during the thinking phase constrains reasoning with domain patterns. Response-only bias with low α is the sweet spot.

Why not RAG alone? RAG injects knowledge via context (expensive, limited by context window). Engram injects at the logit level (zero context cost, unlimited knowledge). They're complementary — RAG for long-form retrieval, Engram for factual bias.

3. RAMP — Data-Free Quantization Pipeline

ramp-quant | 9K lines Python + C

Produces hardware-optimized GGUF without calibration data:

NSDS sensitivity — kurtosis + SVD rank per tensor, data-free
Proxy model — round-trip quantization error × sensitivity = instant loss estimate
Evolutionary search — 128 population, 200 generations, under VRAM budget
Build — generates llama-quantize --custom-q command

GDN sensitivity hierarchy discovered:

SSM gates (α, β)  >  Attention Q/K  >  Shared experts  >  Routed experts
     Q8_0               Q5_K/Q6_K         Q5_K              IQ3_S

What we tried that failed:

Method	Result	Why
QuaRot (Hadamard rotation)	PPL = 49,524	Incompatible with GDN recurrent state
OptRot (Givens rotation)	PPL = 49,524	No cross-layer absorption
ParoQuant (pairwise rotation)	OOM	256 experts × grouped_mm > 16 GB
EvoPress (KL fitness)	OOM	Full model needed in RAM (70 GB)
GPTQ/AWQ layer-wise	Complex	MoE expert handling not supported

4. chimere-server — Rust Inference Runtime

chimere | 56K lines Rust + 2.6K CUDA

The only MoE inference runtime written in Rust, with:

Custom CUDA kernels for sm_120 (IQ3_S dequant, Q8_0+dp4a GEMV, flash attention, fused MoE)
Three backends: libllama FFI (93 tok/s), cudarc (57 tok/s), Candle (18 tok/s)
GDN state save/restore (impossible in llama.cpp — this enables speculative decoding on hybrid architectures)
OpenAI-compatible /v1/chat/completions API with streaming logprobs

Performance journey (5 days, March 14-19):

9.1 → 21 → 30.5 → 42.5 → 57 → 93 tok/s

Through 9 Candle optimizations, Q8_1+dp4a kernels, fused operations, and finally the libllama FFI breakthrough.

5. Self-Improvement Loop

The system improves while idle:

Timer	What	How
03:00 daily	LoRA training	MeZO zeroth-order on quality-filtered pairs (score ≥ 4). Stops GGUF server → trains → restarts (try/finally).
04:00 daily	Engram WRITE	Add validated responses to domain tables. Decay: halve weight >30d, delete >90d.
Mon 02:00	DSPy MIPROv2	Bayesian prompt optimization per domain. Tested: code +8% on benchmark.
Every 6h	RAG reindex	ChromaDB re-ingestion of knowledge base.

Quality scoring: ThinkPRM-1.5B runs on CPU, scores every response 1-5 with step-level chain-of-thought verification. 104 scores logged (mean 3.04/5), 68 training pairs generated, 72 SPIN DPO pairs accumulated.

Quality scoring uses a 9B model on CPU (qwen9b-scorer, port 8085) — runs alongside the production 35B with zero VRAM impact. Previous 27B scorer required stopping production; the 9B CPU scorer eliminated all nightly downtime.

Why not RLHF/DPO on cloud? We can't afford cloud GPU time. MeZO trains at inference cost — the script stops the GGUF server, trains for ~1 min, restarts. Quality is lower than full DPO but it's free and runs every night.

6. DFlash — Block Diffusion Drafter

Paper + code

8 architectures over 27 days. Best holdout result: τ=6.06 (comparable to original DFlash paper). But wall-clock = 0.73× (slowdown) because the target model is too fast (93 tok/s) for speculative decoding to help.

The GDN State Barrier: GDN recurrent layers cannot be rolled back after draft rejection. This is a structural incompatibility affecting all hybrid SSM-attention models (Jamba, RWKV, Qwen3.5). No amount of drafter improvement fixes this — it requires a new runtime (chimere-server provides one).

7. MTP — Multi-Token Prediction (Negative Result)

Model | Patches

First MTP implementation for Qwen3.5 MoE in ik_llama.cpp (5 patches, 8 bugs fixed). 84.8% acceptance but 0.51× speedup — the MTP layer is itself MoE (256 experts on CPU), costing as much as a main forward.

8. Expert Prefetch (Negative Result)

Models

MLP predictor achieves 86.65% hit@8 accuracy. Zero speedup because ggml's multi-threaded CPU loop makes GPU prefetch serialize what was parallel work.

Quantified Results

Metric	Value
Generation throughput (Qwen3.5-35B-A3B prod path)	80 tok/s chimere-server HTTP, ~93 tok/s bare `ik_llama` backend
Generation throughput (Nemotron-3-Nano-30B-A3B, NEW)	~45 tok/s chimere-server HTTP, NCMOE=30, ctx 2048
Model	Qwen3.5-35B-A3B, RAMP-v2 15.2 GB (3.78 BPW)
VRAM usage	~14 GB / 16 GB
Benchmark	10/10 (code, math, tools, domain)
Engram ablation	v1: 77% → OFF: 85% → v2: 88%
Quality scores	104 entries, mean 3.04/5
DFlash τ (holdout)	6.06 (comparable to paper's 6.4)
DFlash wall-clock	0.73× (negative — honest result)
MTP acceptance	84.8% (but 0.51× throughput)
Expert prefetch	86.65% hit@8 (but +1.1% throughput)
Code size	121K lines (Rust + Python + CUDA + C)
Cost	$0.10/day electricity
Hardware	RTX 5060 Ti 16GB, i5-14600KF, 32GB DDR5

Multi-architecture support

As of April 2026 (Step 7 of the chimere-server multi-arch refactor), the same chimere-server runtime dispatches between two code paths based on the GGUF's general.architecture metadata:

Path	Architectures	Features
Qwen3.5 (prod)	`qwen35moe`	Full stack: MTP, MRoPE, Engram, multi-agent, cudarc / Candle / libllama backends, fast C++ sampler
Generic (libllama)	`mamba2`, `nemotron_h_moe`, `mamba`	libllama-only: forward via `LlamaForward` FFI, no MTP, no Engram, single-agent at Step 7

The Generic path was unblocked by a 12-commit Phase 3.x backport of upstream llama.cpp's Mamba-2 / Nemotron-H MoE support into our ik_llama.cpp fork, offered upstream as PR #1593. Validated end-to-end on:

unsloth/Nemotron-3-Nano-30B-A3B-GGUF Q4_0: 45 tok/s on RTX 5060 Ti, NCMOE=30, ctx 2048, via bin/test-nemotron and through HTTP /v1/chat/completions
unsloth/Nemotron-3-Nano-30B-A3B-GGUF UD-IQ3_XXS: same path, coherent text on CPU

Models that should run via the same Generic path (untested at the chimere level — your mileage may vary): Granite 4.0 H-Tiny / H-Small / H-Micro, Falcon-H1 0.5B – 34B, Bamba-9B v1 / v2, state-spaces/mamba2-*, mistralai/Mamba-Codestral-7B-v0.1, AI21-Jamba-Reasoning-3B, Hymba-1.5B-Base, Zamba2-7B.

The technical doc lives at chimere-server/docs/STEP7_MULTI_ARCH.md.

All Repositories

Code

Repository	Lines	What
chimere	96K	Rust runtime + DFlash + MTP patches + 5 papers + Step 7 multi-arch dispatch
chimere-odo	17K	Orchestrator + Engram + search + quality loop
ik_llama.cpp	(fork)	C++ backend fork — branch `mamba2-nemotron-h-backport` + PR #1593
ramp-quant	9K	Quantization pipeline

Models

Model	Size	What
RAMP-v2-15G	15.2 GB	Production GGUF (automated pipeline)
IQ3_S-custom-mix	14.7 GB	Hand-crafted 317-override GGUF
IQ3_S-MTP	18.4 GB	First MTP-enabled GGUF for Qwen3.5 MoE
MeZO LoRA	340 KB	Zeroth-order LoRA proof-of-concept

Data

Dataset	What
chimere-dflash-data	DFlash training prompts (3,927)
chimere-quality-scores	Quality scores + training pairs
chimere-engram-tables	N-gram domain tables
chimere-expert-predictor	4 MLP predictor variants
chimere-calibration	imatrix calibration corpus

Papers (5 drafts, arXiv pending endorsement)

Block Diffusion Drafting for Hybrid MoE Models — 8 architectures, GDN State Barrier, wall-clock 0.73×
Chimère System Paper — the complete self-improving stack
RAMP: Data-Free Mixed-Precision Quantization — 7 builds, QuaRot failure, RAMP-v2
MTP on Qwen3.5 MoE — 84.8% acceptance, 0.51× (negative result)
Expert Prefetch — 86.65% hit@8, zero speedup (negative result)

All LaTeX sources: chimere/paper/latex/

Author

Kévin Rémondière — Independent ML researcher, Oloron-Sainte-Marie, France

ORCID: 0009-0008-2443-7166

Built in 7 weeks on a desktop. Everything open-source. The model improves in its sleep.

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for Kevletesteur/chimere-system

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Paper • 2601.07372 • Published Jan 12 • 47