Chimère: A Self-Improving MoE Inference System for Consumer Hardware
35B parameters. 80 tokens/second on the production HTTP path. One GPU. $0.10/day. The model improves while you sleep.
🆕 Latest update — April 2026: Step 7 multi-architecture dispatch. The same
chimere-serverruntime now also runs Mamba-2 / Nemotron-H MoE hybrid SSM models end-to-end, on top of a custom backport of upstreamllama.cpp's Mamba-2 work into ourik_llama.cppfork (offered upstream as PR #1593). NVIDIA Nemotron-3-Nano-30B-A3B Q4_0 measured at ~45 tok/s on RTX 5060 Ti (sm_120, NCMOE=30, ctx 2048). Qwen3.5 production path is byte-for-byte unchanged. See Multi-architecture support below.
What is Chimère?
Chimère is a complete inference system — not just a model or a runtime, but an integrated stack where every component feeds the others. It runs Qwen3.5-35B-A3B (35B total parameters, 3.5B active per token, 256 experts) on a single RTX 5060 Ti (16 GB VRAM) at **80 tok/s on the chimere-server HTTP production path** (the bare ik_llama backend reaches ~93 tok/s; the Rust HTTP / sampling layer adds the difference). A nightly quality loop further improves the system from production traffic.
This is the kind of system NVIDIA builds for enterprise deployments, except it runs on a desktop in the south of France.
User request
│
▼
┌─────────────────────────────────────────────────────────┐
│ ODO — Unified Orchestrator (port 8084) │
│ ┌─────────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ Intent │ │ Entropy │ │ Confidence │ │
│ │ Classifier │ │ Router │ │ RAG Trigger │ │
│ │ (3-cascade) │ │ (fast/qual/ │ │ (logprob │ │
│ │ │ │ ultra) │ │ probe) │ │
│ └──────┬──────┘ └──────┬───────┘ └──────┬────────┘ │
│ │ │ │ │
│ ┌──────▼──────────────────────────────────▼────────┐ │
│ │ Enrichment Pipeline │ │
│ │ • Web Search SOTA (8-stage: expand→search→RRF→ │ │
│ │ fetch→chunk→rerank→CRAG→synthesize) │ │
│ │ • ChromaDB RAG (dense + BM25 + RRF + cross-enc) │ │
│ │ • FAISS Semantic Few-shot (per domain) │ │
│ │ • Dynamic Engram (web → n-gram logit bias) │ │
│ │ • Tool Injection (auto, from pipeline YAML) │ │
│ └──────┬────────────────────────────────────────────┘ │
│ │ │
│ ┌──────▼────────┐ ┌────────────────┐ │
│ │ DVTS Tree │ │ ABF + CGRS │ │
│ │ Search (K=2, │ │ (thinking │ │
│ │ ThinkPRM) │ │ budget mgmt) │ │
│ └──────┬────────┘ └──────┬─────────┘ │
└─────────┼──────────────────┼────────────────────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────┐
│ chimere-server (port 8081) — Rust Runtime │
│ • ik_llama FFI backend (93 tok/s) │
│ • Multi-tier Engram (Cuckoo <10ns → hash O(1) → FAISS)│
│ • Logprobs (top-5 log-softmax, real values) │
│ • ABF token 248069 forcing at budget threshold │
│ • IQ3_S custom-mix / RAMP-v2 (15.2 GB, 3.78 BPW) │
│ • KV cache: q8_0 keys + q4_0 values (sweet spot) │
└──────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Quality & Self-Improvement Loop │
│ • ThinkPRM-1.5B (CPU): step-level verification │
│ • quality_scores.jsonl (104 scores, mean 3.04/5) │
│ • training_pairs.jsonl (68 pairs, score ≥ 4) │
│ • 03:00 — Nightly LoRA (MeZO, stops GGUF→train→restart)│
│ • 04:00 — Engram WRITE (quality-gated, decay >30d) │
│ • Mon 02:00 — DSPy MIPROv2 prompt optimization │
│ • 6h — ChromaDB RAG reindex │
└─────────────────────────────────────────────────────────┘
Why This Is SOTA for Consumer Hardware (March 2026)
vs. Existing Solutions
| System | What it does | What Chimère adds |
|---|---|---|
| llama.cpp / ik_llama | Serve GGUF models | We use ik_llama as backend (+23% vs stock), add Engram, ABF, quality loop |
| KTransformers | CPU/GPU co-serving for MoE | We go further: per-tensor mixed-precision (RAMP), self-improving quality |
| Ollama / LM Studio | Easy local LLM UI | No orchestration, no quality loop, no domain memory, no nightly improvement |
| vLLM / TensorRT-LLM | High-throughput serving | Requires A100+, no consumer GPU support for 35B MoE |
| OpenRouter / Together AI | API access to MoE models | Cloud, not local. $0.60/M tokens vs $0.10/day |
| Autoresearch (Karpathy) | Self-improving research agent | Concept paper, not deployed system. No runtime, no quantization |
| DeepSeek Engram | Conditional memory for MoE | Paper only. We implemented multi-tier version with quality-gated write |
What No One Else Has Combined
- Rust runtime + custom CUDA sm_120 kernels for Blackwell consumer GPUs — 56K lines, the only MoE runtime in Rust
- RAMP data-free quantization — per-tensor mixed-precision without calibration data, 15.2 GB GGUF
- Multi-tier Engram — Cuckoo filter (<10ns) → N-gram hash (O(1)) → FAISS semantic (~5ms), with quality-gated nightly writes
- Adaptive Budget Forcing — thinking budget management for quantized reasoning (IQ3_S produces less coherent thinking than BF16, so budget must be shorter)
- The quality loop — ThinkPRM scores every response, high-quality pairs feed nightly LoRA + Engram + DSPy. The system improves from production traffic.
- Honest negative results — we tried and documented why speculative decoding (DFlash τ=6.06, wall-clock 0.73×), MTP (84.8% acceptance, 0.51×), and expert prefetch (86.65% hit, +1.1%) don't help on this hardware.
The Components
1. ODO — Unified Orchestrator
chimere-odo | 17K lines Python
A single proxy between the user and the model that adds intelligence:
- Intent Classification: 3-strategy cascade (regex 99% <1ms → filetype → LLM GBNF <50ms). Routes to: code, kine, cyber, research, default.
- Entropy Router: Measures query complexity → fast (no-think, 0.7 temp), quality (think, 2048 budget, ABF 0.55), ultra (DVTS K=2, ThinkPRM).
- Enrichment: Web search (8-stage SOTA pipeline), ChromaDB RAG (dense+BM25+RRF+cross-encoder), semantic few-shot, dynamic Engram.
- Quality Gate: ThinkPRM-1.5B verifies step-level reasoning. Score ≤ 2 → retry with reflection.
- Pipeline YAML: Define multi-step agent workflows with hot-reload. 5 pipelines shipped (code: architect→coder, cyber: triage→correlate→remediate).
Why not LangChain/LlamaIndex? Too heavy, too many abstractions, designed for cloud APIs. ODO is 1,525 lines doing exactly what we need with zero external framework dependencies.
2. Engram — Multi-Tier Domain Memory
Inspired by DeepSeek Engram but implemented as a lightweight lookup system:
- Tier 0 — Cuckoo filter (<10ns): skips 97% of lookups for tokens not in any table
- Tier 1 — FNV-1a hash tables (O(1)): core N-gram matching, binary format compatible with Rust and Python
- Tier 2 — FAISS semantic (~5ms): embedding-based few-shot example retrieval
Ablation results (measured on 10-question benchmark):
Engram v1 (α=0.35, think+response): 77% ← biases thinking, DEGRADES
Engram OFF (α=0): 85% ← baseline
Engram v2 (α=0.1, response-only): 88% ← PRODUCTION
Key insight: applying Engram bias during the thinking phase constrains reasoning with domain patterns. Response-only bias with low α is the sweet spot.
Why not RAG alone? RAG injects knowledge via context (expensive, limited by context window). Engram injects at the logit level (zero context cost, unlimited knowledge). They're complementary — RAG for long-form retrieval, Engram for factual bias.
3. RAMP — Data-Free Quantization Pipeline
ramp-quant | 9K lines Python + C
Produces hardware-optimized GGUF without calibration data:
- NSDS sensitivity — kurtosis + SVD rank per tensor, data-free
- Proxy model — round-trip quantization error × sensitivity = instant loss estimate
- Evolutionary search — 128 population, 200 generations, under VRAM budget
- Build — generates
llama-quantize --custom-qcommand
GDN sensitivity hierarchy discovered:
SSM gates (α, β) > Attention Q/K > Shared experts > Routed experts
Q8_0 Q5_K/Q6_K Q5_K IQ3_S
What we tried that failed:
| Method | Result | Why |
|---|---|---|
| QuaRot (Hadamard rotation) | PPL = 49,524 | Incompatible with GDN recurrent state |
| OptRot (Givens rotation) | PPL = 49,524 | No cross-layer absorption |
| ParoQuant (pairwise rotation) | OOM | 256 experts × grouped_mm > 16 GB |
| EvoPress (KL fitness) | OOM | Full model needed in RAM (70 GB) |
| GPTQ/AWQ layer-wise | Complex | MoE expert handling not supported |
4. chimere-server — Rust Inference Runtime
chimere | 56K lines Rust + 2.6K CUDA
The only MoE inference runtime written in Rust, with:
- Custom CUDA kernels for sm_120 (IQ3_S dequant, Q8_0+dp4a GEMV, flash attention, fused MoE)
- Three backends: libllama FFI (93 tok/s), cudarc (57 tok/s), Candle (18 tok/s)
- GDN state save/restore (impossible in llama.cpp — this enables speculative decoding on hybrid architectures)
- OpenAI-compatible
/v1/chat/completionsAPI with streaming logprobs
Performance journey (5 days, March 14-19):
9.1 → 21 → 30.5 → 42.5 → 57 → 93 tok/s
Through 9 Candle optimizations, Q8_1+dp4a kernels, fused operations, and finally the libllama FFI breakthrough.
5. Self-Improvement Loop
The system improves while idle:
| Timer | What | How |
|---|---|---|
| 03:00 daily | LoRA training | MeZO zeroth-order on quality-filtered pairs (score ≥ 4). Stops GGUF server → trains → restarts (try/finally). |
| 04:00 daily | Engram WRITE | Add validated responses to domain tables. Decay: halve weight >30d, delete >90d. |
| Mon 02:00 | DSPy MIPROv2 | Bayesian prompt optimization per domain. Tested: code +8% on benchmark. |
| Every 6h | RAG reindex | ChromaDB re-ingestion of knowledge base. |
Quality scoring: ThinkPRM-1.5B runs on CPU, scores every response 1-5 with step-level chain-of-thought verification. 104 scores logged (mean 3.04/5), 68 training pairs generated, 72 SPIN DPO pairs accumulated.
Quality scoring uses a 9B model on CPU (qwen9b-scorer, port 8085) — runs alongside the production 35B with zero VRAM impact. Previous 27B scorer required stopping production; the 9B CPU scorer eliminated all nightly downtime.
Why not RLHF/DPO on cloud? We can't afford cloud GPU time. MeZO trains at inference cost — the script stops the GGUF server, trains for ~1 min, restarts. Quality is lower than full DPO but it's free and runs every night.
6. DFlash — Block Diffusion Drafter
8 architectures over 27 days. Best holdout result: τ=6.06 (comparable to original DFlash paper). But wall-clock = 0.73× (slowdown) because the target model is too fast (93 tok/s) for speculative decoding to help.
The GDN State Barrier: GDN recurrent layers cannot be rolled back after draft rejection. This is a structural incompatibility affecting all hybrid SSM-attention models (Jamba, RWKV, Qwen3.5). No amount of drafter improvement fixes this — it requires a new runtime (chimere-server provides one).
7. MTP — Multi-Token Prediction (Negative Result)
First MTP implementation for Qwen3.5 MoE in ik_llama.cpp (5 patches, 8 bugs fixed). 84.8% acceptance but 0.51× speedup — the MTP layer is itself MoE (256 experts on CPU), costing as much as a main forward.
8. Expert Prefetch (Negative Result)
MLP predictor achieves 86.65% hit@8 accuracy. Zero speedup because ggml's multi-threaded CPU loop makes GPU prefetch serialize what was parallel work.
Quantified Results
| Metric | Value |
|---|---|
| Generation throughput (Qwen3.5-35B-A3B prod path) | 80 tok/s chimere-server HTTP, ~93 tok/s bare ik_llama backend |
| Generation throughput (Nemotron-3-Nano-30B-A3B, NEW) | ~45 tok/s chimere-server HTTP, NCMOE=30, ctx 2048 |
| Model | Qwen3.5-35B-A3B, RAMP-v2 15.2 GB (3.78 BPW) |
| VRAM usage | ~14 GB / 16 GB |
| Benchmark | 10/10 (code, math, tools, domain) |
| Engram ablation | v1: 77% → OFF: 85% → v2: 88% |
| Quality scores | 104 entries, mean 3.04/5 |
| DFlash τ (holdout) | 6.06 (comparable to paper's 6.4) |
| DFlash wall-clock | 0.73× (negative — honest result) |
| MTP acceptance | 84.8% (but 0.51× throughput) |
| Expert prefetch | 86.65% hit@8 (but +1.1% throughput) |
| Code size | 121K lines (Rust + Python + CUDA + C) |
| Cost | $0.10/day electricity |
| Hardware | RTX 5060 Ti 16GB, i5-14600KF, 32GB DDR5 |
Multi-architecture support
As of April 2026 (Step 7 of the chimere-server multi-arch refactor), the same chimere-server runtime dispatches between two code paths based on the GGUF's general.architecture metadata:
| Path | Architectures | Features |
|---|---|---|
| Qwen3.5 (prod) | qwen35moe |
Full stack: MTP, MRoPE, Engram, multi-agent, cudarc / Candle / libllama backends, fast C++ sampler |
| Generic (libllama) | mamba2, nemotron_h_moe, mamba |
libllama-only: forward via LlamaForward FFI, no MTP, no Engram, single-agent at Step 7 |
The Generic path was unblocked by a 12-commit Phase 3.x backport of upstream llama.cpp's Mamba-2 / Nemotron-H MoE support into our ik_llama.cpp fork, offered upstream as PR #1593. Validated end-to-end on:
unsloth/Nemotron-3-Nano-30B-A3B-GGUFQ4_0: 45 tok/s on RTX 5060 Ti, NCMOE=30, ctx 2048, viabin/test-nemotronand through HTTP/v1/chat/completionsunsloth/Nemotron-3-Nano-30B-A3B-GGUFUD-IQ3_XXS: same path, coherent text on CPU
Models that should run via the same Generic path (untested at the chimere level — your mileage may vary): Granite 4.0 H-Tiny / H-Small / H-Micro, Falcon-H1 0.5B – 34B, Bamba-9B v1 / v2, state-spaces/mamba2-*, mistralai/Mamba-Codestral-7B-v0.1, AI21-Jamba-Reasoning-3B, Hymba-1.5B-Base, Zamba2-7B.
The technical doc lives at chimere-server/docs/STEP7_MULTI_ARCH.md.
All Repositories
Code
| Repository | Lines | What |
|---|---|---|
| chimere | 96K | Rust runtime + DFlash + MTP patches + 5 papers + Step 7 multi-arch dispatch |
| chimere-odo | 17K | Orchestrator + Engram + search + quality loop |
| ik_llama.cpp | (fork) | C++ backend fork — branch mamba2-nemotron-h-backport + PR #1593 |
| ramp-quant | 9K | Quantization pipeline |
Models
| Model | Size | What |
|---|---|---|
| RAMP-v2-15G | 15.2 GB | Production GGUF (automated pipeline) |
| IQ3_S-custom-mix | 14.7 GB | Hand-crafted 317-override GGUF |
| IQ3_S-MTP | 18.4 GB | First MTP-enabled GGUF for Qwen3.5 MoE |
| MeZO LoRA | 340 KB | Zeroth-order LoRA proof-of-concept |
Data
| Dataset | What |
|---|---|
| chimere-dflash-data | DFlash training prompts (3,927) |
| chimere-quality-scores | Quality scores + training pairs |
| chimere-engram-tables | N-gram domain tables |
| chimere-expert-predictor | 4 MLP predictor variants |
| chimere-calibration | imatrix calibration corpus |
Papers (5 drafts, arXiv pending endorsement)
- Block Diffusion Drafting for Hybrid MoE Models — 8 architectures, GDN State Barrier, wall-clock 0.73×
- Chimère System Paper — the complete self-improving stack
- RAMP: Data-Free Mixed-Precision Quantization — 7 builds, QuaRot failure, RAMP-v2
- MTP on Qwen3.5 MoE — 84.8% acceptance, 0.51× (negative result)
- Expert Prefetch — 86.65% hit@8, zero speedup (negative result)
All LaTeX sources: chimere/paper/latex/
Author
Kévin Rémondière — Independent ML researcher, Oloron-Sainte-Marie, France
ORCID: 0009-0008-2443-7166
Built in 7 weeks on a desktop. Everything open-source. The model improves in its sleep.