Qwen3.5-35B-A3B Chimere v3 -- RAMP GGUF
Chimere v3: Claude Opus 4.6 distillation of Qwen3.5-35B-A3B, optimized for instruction following and reasoning.
RAMP quantization (per-tensor quality overrides + imatrix) -- 15 GB, fits 16 GB VRAM, ~80 tok/s on RTX 5060 Ti.
Looking for v1 (best code + tools)? See Chimere v1 GGUF.
Compatible runtimes
This GGUF can be loaded by any runtime that supports the Qwen3.5-35B-A3B (qwen35moe) architecture. The reference runtime — and the one that exercises all chimere-specific features (Engram n-gram bias, multi-agent context switching, the C++ fast sampler with DRY + min-p, K-cache Hadamard rotation, fused MoE up/gate) — is chimere-server.
| Runtime | Engram | Multi-agent | DRY sampler | K-cache Hadamard | Notes |
|---|---|---|---|---|---|
| chimere-server (Rust, official) | yes | yes | yes (C++ fast path) | yes | Production target. Also runs Mamba-2 / Nemotron-H MoE through the same backend (PR ikawrakow/ik_llama.cpp#1593). |
ik_llama.cpp llama-server |
no | no | optional | optional | Same backend that chimere-server links against, just without the Rust HTTP/sampling layer. |
llama.cpp stock llama-server |
no | no | no | no | Works, but slower on Qwen3.5 MoE on our hardware (no iqk matmul, no fused MoE up/gate). |
Benchmark Results
v3 strengths: instructions and reasoning
| Benchmark | v3 RAMP (this repo) | v1 RAMP | Base Qwen3.5-35B-A3B | Notes |
|---|---|---|---|---|
| IFEval (15 instruction tests) | 100% | 67% | ~91.9% | +33 pts vs v1 |
| Edge cases (15 adversarial tests) | 100% | 87% | -- | Perfect prompt injection resistance |
| GSM8K CoT 8-shot (1,319 qs) | 84.0% | 52.2% | -- | +32 pts vs v1 |
| HumanEval (30 problems, executed) | 83% | 97% | -- | v1 better here |
| BFCL tool-calling (20 questions) | 75% | 90% | 67.3% | v1 better here |
| Speed (RTX 5060 Ti 16 GB, chimere-server) | ~80 tok/s | ~80 tok/s | -- | NCMOE=3, ctx 64K |
Qualitative agentic tests
| Scenario | v3 | v1 | /10 |
|---|---|---|---|
| Cybersecurity incident response (multi-tool chain) | 4 | 4 | 10 |
| ML pipeline architecture (RAG, 10K users, $50K budget) | 8 | 8 | 10 |
| Rust MoE runtime optimization (async prefetch, CUDA) | 8 | 7 | 10 |
| Total | 20 | 19 | 30 |
Honest assessment
- Strengths: 100% IFEval, 100% adversarial edge cases, 84% GSM8K, best overall reasoning
- Weaknesses: Code generation slightly weaker (83% vs 97%), tool-calling lower (75% vs 90%)
- Why: v3 dataset added IFEval-strict, OPSDC-compressed reasoning, and instruction-following samples on top of v1 base. Recommended for general agentic use.
Which version to use?
| Use case | Recommended | Why |
|---|---|---|
| Instruction following, formatting | v3 (this repo) | 100% IFEval, 100% edge cases |
| Math reasoning | v3 (this repo) | 84% GSM8K (vs 52% v1) |
| Prompt injection resistance | v3 (this repo) | 100% adversarial edge cases |
| Code generation, debugging | v1 | 97% HumanEval |
| Tool-calling, function calling | v1 | 90% BFCL |
| Re-quantization or fine-tuning | BF16 weights | Full precision |
Best of both worlds: Use A-LoRA routing -- an intent classifier selects the appropriate LoRA at runtime. Code/tools queries use v1, instruction/reasoning queries use v3. See Chimere ODO.
Quick start (chimere-server, recommended)
# 1. Backend (one-time): build the ik_llama.cpp fork with sm_120 CUDA + Mamba-2 backport
git clone https://github.com/AIdevsmartdata/ik_llama.cpp.git ~/ik_llama.cpp
cd ~/ik_llama.cpp
git checkout mamba2-nemotron-h-backport
cmake -B build_sm120 -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_NATIVE=OFF
cmake --build build_sm120 -j
# 2. Server
git clone https://github.com/AIdevsmartdata/chimere.git
cd chimere/chimere-server
LD_LIBRARY_PATH=$HOME/ik_llama.cpp/build_sm120/ggml/src:$HOME/ik_llama.cpp/build_sm120/src:/usr/local/cuda-12.8/lib64 \
cargo build --release --features server --bin chimere-server
# 3. Model + tokenizer
mkdir -p ~/models && cd ~/models
hf download Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF chimere-v3-ramp.gguf
hf download Qwen/Qwen3.5-35B-A3B tokenizer.json --local-dir tokenizers/qwen35
# 4. Run (production env vars)
CHIMERE_MODEL=$PWD/chimere-v3-ramp.gguf \
CHIMERE_TOKENIZER=$PWD/tokenizers/qwen35/tokenizer.json \
CHIMERE_LLAMA_BACKEND=1 \
CHIMERE_NCMOE=3 \
CHIMERE_KV_MAX_SEQ=65536 \
CHIMERE_PORT=8081 \
CHIMERE_FORCE_QWEN35=1 \
LD_LIBRARY_PATH=$HOME/ik_llama.cpp/build_sm120/ggml/src:$HOME/ik_llama.cpp/build_sm120/src:/usr/local/cuda-12.8/lib64 \
~/chimere/chimere-server/target/release/chimere-server
# 5. Hello world
curl -s http://localhost:8081/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":64}'
Engram (optional, prod-only)
Chimere ships an n-gram logit bias overlay loaded from binary .engr tables. To enable it, set:
CHIMERE_ENGRAM_DIR=/path/to/engram_tables # directory of *.engr files
CHIMERE_ENGRAM_ALPHA=0.1 # logit bias strength
The engram tables are tokenizer-specific (Qwen3.5 vocab) and used as a per-domain overlay (kine, code, cyber, general). They are intended as a domain-knowledge injector, not a measured quality booster — see the chimere repo README for the honest status of the path.
Quick start (generic GGUF runtimes)
If you do not need the chimere stack, the GGUF works with any Qwen3.5-compatible runtime:
# llama.cpp / llama-server
llama-server \
-m chimere-v3-ramp.gguf \
-ngl 99 --n-cpu-moe 4 -c 32768 \
--flash-attn on --jinja --port 8081
# For 16 GB VRAM (RTX 5060 Ti / RTX 4080):
# Add KV cache quantization to save VRAM:
# -ctk q8_0 -ctv q4_0
Recommended sampling parameters
| Mode | temp | top_p | top_k | presence_penalty |
|---|---|---|---|---|
| Thinking (default) | 1.0 | 0.95 | 20 | 0.0 |
| Thinking + code/tools | 0.6 | 0.95 | 20 | 0.0 |
| No-think | 0.7 | 0.8 | 20 | 0.0 |
Backend
The official chimere-server runtime links against a customized ik_llama.cpp fork (branch mamba2-nemotron-h-backport, head of upstream PR ikawrakow/ik_llama.cpp#1593).
Highlights of the chimere-specific layer on top of ik_llama:
- Custom C++ fast sampler exporting
sample_token_fast,set_logit_bias,set_engram_bias,clear_engram_biasandtake_packed_logprobs— avoids a ~993 KB logits copy per token, packs OpenAI-format top-5 logprobs. - K-cache Hadamard rotation, fused MoE up/gate, grouped expert routing — all enabled by default via
cparams. - Multi-agent KV / SSM state save & restore via
llama_state_seq_*, keyed on the OpenAIuserfield. Up toCHIMERE_MAX_AGENTS(default 4) concurrent personas with their own conversation state. - An OpenAI-compatible HTTP layer in Rust (axum 0.8), supporting non-streaming and SSE streaming, tool calls,
<think>reasoning extraction andchat_template_kwargs.enable_thinking.
Multi-architecture support
The same chimere-server runtime is not Qwen-only any more. As of Step 7 (April 2026), it dispatches between two code paths based on the GGUF's general.architecture metadata:
- Qwen3.5-35B-A3B (
qwen35moe) — full production stack: MTP, MRoPE, Engram, agent scheduler, custom Candle / cudarc / libllama paths. This GGUF. - Mamba-2 / Nemotron-H MoE / Mamba-1 / Mamba-2 hybrids — libllama-only path via
GenericModel. No MTP, no Engram, single-agent only at Step 7. Validated end-to-end onunsloth/Nemotron-3-Nano-30B-A3B-GGUF(Q4_0 and UD-IQ3_XXS) at ~45 tok/s on RTX 5060 Ti, NCMOE=30, ctx 2048, via the bundledtest-nemotronsmoke binary.
Models that should run via the same Generic path (untested at the chimere level — your mileage may vary): Granite 4.0 H-Tiny / H-Small / H-Micro, Falcon-H1 0.5B – 34B, Bamba-9B v1 / v2, state-spaces/mamba2-*, mistralai/Mamba-Codestral-7B-v0.1, AI21-Jamba-Reasoning-3B.
RAMP Quantization Details
Custom per-tensor quality overrides -- critical paths get higher precision. Overall: ~3.78 BPW.
| Tensor | Quant | BPW | Rationale |
|---|---|---|---|
| attn_v (value) | Q8_0 | 8.0 | Most critical -- errors cause hallucinations |
| ssm_alpha, ssm_d | Q8_0 | 8.0 | GDN recurrent params, tiny but hypersensitive |
| attn_k (key) | Q6_K | 6.5 | Important for attention routing |
| ssm_dt | Q6_K | 6.5 | GDN timestep |
| token_embd, output | Q6_K | 6.5 | Shared embeddings |
| attn_q, attn_output | Q5_K | 5.5 | More tolerant |
| ssm_in, ssm_out | Q5_K | 5.5 | SSM projections |
| 256 MoE experts (FFN) | IQ3_S | 3.44 | 80% of params, high MoE redundancy |
- imatrix: Generated on BF16 model (B200, 192 GB VRAM), 200 calibration chunks
- Result: 15 GB with zero quality loss on agentic benchmarks vs BF16
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3.5-35B-A3B (MoE, 256 experts) |
| Method | SFT BF16 LoRA r64, completion-only loss |
| Dataset | 10,191 samples (v1 base + 428 additional: IFEval strict, OPSDC reasoning, instruction following) |
| Epochs | 1 (160 steps, batch 64) |
| Training GPU | NVIDIA B200 |
| Training cost | ~$2 |
v3 dataset additions (on top of v1 base)
- +50 IFEval strict (5 constraint categories)
- +30 strict code (no markdown)
- +30 code gen with thinking
- +30 instruction following
- +20 OPSDC-compressed reasoning (-64% tokens)
- +15 multi-turn agentic
Limitations
- MTP infrastructure present, gated. This GGUF carries an MTP (multi-token prediction) head — chimere-server detects it via
n_nextn_layer = 1and exposes the speculative-decoding infrastructure (mtp_scheduler.rs,MtpOpFFI). An early March bench on a previous build measured +49.5% token acceptance rate for the MTP draft path; that figure is not currently reproducible becausebench_mtp.rs:104-167has Benchmarks 2 and 5 hard-coded asSKIPPEDwith the commentcrash in ik_llama MTP graph, KV cache issue for layer 41. Until that fix lands the 80 tok/s figure above is the non-MTP path. We will re-publish the MTP gain once the bench passes. - Engram is a domain-knowledge overlay, not a measured quality boost. The only saved engram eval in the chimere repo (
benchmarks/engram_trained_eval.json) was run on GPT-2 + wikitext-2 and shows a −13.39% PPL regression on that out-of-distribution setup. No Qwen3.5-specific perplexity eval has been published yet. Engram is shipped as an optional per-domain n-gram bias (kine, code, cyber, general); qualitative use shows specialized vocabulary in responses (drainage bronchique postural,EMII, ...) on the kiné domain, but there is no quantitative claim attached to it today. - Multi-slot concurrent decoding via
ik_llama.cppis broken under heavy load (ik_llamamulti-slot bug, slot 0 contamination of system prompts under contention). Thechimere-serverproduction deployment is single-slot. Stockllama-serverdoes NOT have this bug if you need parallel slots. - Tool-calling sampler defaults:
presence_penaltydefaults to0.0— a previous default of1.5killed code generation and long reasoning blocks. See chimere-server source.
Files
| File | Size | Description |
|---|---|---|
chimere-v3-ramp.gguf |
15 GB | v3 RAMP GGUF (instructions + reasoning focus) |
imatrix.dat |
184 MB | Importance matrix used for quantization |
Related
- chimere -- Official Rust runtime (chimere-server) with Engram, MTP, multi-agent, multi-arch dispatch
- ik_llama.cpp fork -- Backend with Mamba-2 + Nemotron-H backport (PR #1593)
- Chimere v1 GGUF -- Best code + tools
- BF16 full weights -- For re-quantization or fine-tuning
- LoRA adapter -- For further training
- Chimere ODO -- A-LoRA intent routing
Citation
@misc{chimere-v3-2026,
title={Chimere v3: Claude Opus 4.6 Distillation of Qwen3.5-35B-A3B MoE for Instructions and Reasoning},
author={Kevletesteur},
year={2026},
url={https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF}
}
- Downloads last month
- 1,948
We're not able to determine the quantization variants.