Qwen3.5-35B-A3B Chimere v3 -- RAMP GGUF

Chimere v3: Claude Opus 4.6 distillation of Qwen3.5-35B-A3B, optimized for instruction following and reasoning.

RAMP quantization (per-tensor quality overrides + imatrix) -- 15 GB, fits 16 GB VRAM, ~80 tok/s on RTX 5060 Ti.

Looking for v1 (best code + tools)? See Chimere v1 GGUF.

Compatible runtimes

This GGUF can be loaded by any runtime that supports the Qwen3.5-35B-A3B (qwen35moe) architecture. The reference runtime — and the one that exercises all chimere-specific features (Engram n-gram bias, multi-agent context switching, the C++ fast sampler with DRY + min-p, K-cache Hadamard rotation, fused MoE up/gate) — is chimere-server.

Runtime Engram Multi-agent DRY sampler K-cache Hadamard Notes
chimere-server (Rust, official) yes yes yes (C++ fast path) yes Production target. Also runs Mamba-2 / Nemotron-H MoE through the same backend (PR ikawrakow/ik_llama.cpp#1593).
ik_llama.cpp llama-server no no optional optional Same backend that chimere-server links against, just without the Rust HTTP/sampling layer.
llama.cpp stock llama-server no no no no Works, but slower on Qwen3.5 MoE on our hardware (no iqk matmul, no fused MoE up/gate).

Benchmark Results

v3 strengths: instructions and reasoning

Benchmark v3 RAMP (this repo) v1 RAMP Base Qwen3.5-35B-A3B Notes
IFEval (15 instruction tests) 100% 67% ~91.9% +33 pts vs v1
Edge cases (15 adversarial tests) 100% 87% -- Perfect prompt injection resistance
GSM8K CoT 8-shot (1,319 qs) 84.0% 52.2% -- +32 pts vs v1
HumanEval (30 problems, executed) 83% 97% -- v1 better here
BFCL tool-calling (20 questions) 75% 90% 67.3% v1 better here
Speed (RTX 5060 Ti 16 GB, chimere-server) ~80 tok/s ~80 tok/s -- NCMOE=3, ctx 64K

Qualitative agentic tests

Scenario v3 v1 /10
Cybersecurity incident response (multi-tool chain) 4 4 10
ML pipeline architecture (RAG, 10K users, $50K budget) 8 8 10
Rust MoE runtime optimization (async prefetch, CUDA) 8 7 10
Total 20 19 30

Honest assessment

  • Strengths: 100% IFEval, 100% adversarial edge cases, 84% GSM8K, best overall reasoning
  • Weaknesses: Code generation slightly weaker (83% vs 97%), tool-calling lower (75% vs 90%)
  • Why: v3 dataset added IFEval-strict, OPSDC-compressed reasoning, and instruction-following samples on top of v1 base. Recommended for general agentic use.

Which version to use?

Use case Recommended Why
Instruction following, formatting v3 (this repo) 100% IFEval, 100% edge cases
Math reasoning v3 (this repo) 84% GSM8K (vs 52% v1)
Prompt injection resistance v3 (this repo) 100% adversarial edge cases
Code generation, debugging v1 97% HumanEval
Tool-calling, function calling v1 90% BFCL
Re-quantization or fine-tuning BF16 weights Full precision

Best of both worlds: Use A-LoRA routing -- an intent classifier selects the appropriate LoRA at runtime. Code/tools queries use v1, instruction/reasoning queries use v3. See Chimere ODO.

Quick start (chimere-server, recommended)

# 1. Backend (one-time): build the ik_llama.cpp fork with sm_120 CUDA + Mamba-2 backport
git clone https://github.com/AIdevsmartdata/ik_llama.cpp.git ~/ik_llama.cpp
cd ~/ik_llama.cpp
git checkout mamba2-nemotron-h-backport
cmake -B build_sm120 -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_NATIVE=OFF
cmake --build build_sm120 -j

# 2. Server
git clone https://github.com/AIdevsmartdata/chimere.git
cd chimere/chimere-server
LD_LIBRARY_PATH=$HOME/ik_llama.cpp/build_sm120/ggml/src:$HOME/ik_llama.cpp/build_sm120/src:/usr/local/cuda-12.8/lib64 \
  cargo build --release --features server --bin chimere-server

# 3. Model + tokenizer
mkdir -p ~/models && cd ~/models
hf download Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF chimere-v3-ramp.gguf
hf download Qwen/Qwen3.5-35B-A3B tokenizer.json --local-dir tokenizers/qwen35

# 4. Run (production env vars)
CHIMERE_MODEL=$PWD/chimere-v3-ramp.gguf \
CHIMERE_TOKENIZER=$PWD/tokenizers/qwen35/tokenizer.json \
CHIMERE_LLAMA_BACKEND=1 \
CHIMERE_NCMOE=3 \
CHIMERE_KV_MAX_SEQ=65536 \
CHIMERE_PORT=8081 \
CHIMERE_FORCE_QWEN35=1 \
LD_LIBRARY_PATH=$HOME/ik_llama.cpp/build_sm120/ggml/src:$HOME/ik_llama.cpp/build_sm120/src:/usr/local/cuda-12.8/lib64 \
~/chimere/chimere-server/target/release/chimere-server

# 5. Hello world
curl -s http://localhost:8081/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":64}'

Engram (optional, prod-only)

Chimere ships an n-gram logit bias overlay loaded from binary .engr tables. To enable it, set:

CHIMERE_ENGRAM_DIR=/path/to/engram_tables   # directory of *.engr files
CHIMERE_ENGRAM_ALPHA=0.1                     # logit bias strength

The engram tables are tokenizer-specific (Qwen3.5 vocab) and used as a per-domain overlay (kine, code, cyber, general). They are intended as a domain-knowledge injector, not a measured quality booster — see the chimere repo README for the honest status of the path.

Quick start (generic GGUF runtimes)

If you do not need the chimere stack, the GGUF works with any Qwen3.5-compatible runtime:

# llama.cpp / llama-server
llama-server \
    -m chimere-v3-ramp.gguf \
    -ngl 99 --n-cpu-moe 4 -c 32768 \
    --flash-attn on --jinja --port 8081

# For 16 GB VRAM (RTX 5060 Ti / RTX 4080):
# Add KV cache quantization to save VRAM:
# -ctk q8_0 -ctv q4_0

Recommended sampling parameters

Mode temp top_p top_k presence_penalty
Thinking (default) 1.0 0.95 20 0.0
Thinking + code/tools 0.6 0.95 20 0.0
No-think 0.7 0.8 20 0.0

Backend

The official chimere-server runtime links against a customized ik_llama.cpp fork (branch mamba2-nemotron-h-backport, head of upstream PR ikawrakow/ik_llama.cpp#1593).

Highlights of the chimere-specific layer on top of ik_llama:

  • Custom C++ fast sampler exporting sample_token_fast, set_logit_bias, set_engram_bias, clear_engram_bias and take_packed_logprobs — avoids a ~993 KB logits copy per token, packs OpenAI-format top-5 logprobs.
  • K-cache Hadamard rotation, fused MoE up/gate, grouped expert routing — all enabled by default via cparams.
  • Multi-agent KV / SSM state save & restore via llama_state_seq_*, keyed on the OpenAI user field. Up to CHIMERE_MAX_AGENTS (default 4) concurrent personas with their own conversation state.
  • An OpenAI-compatible HTTP layer in Rust (axum 0.8), supporting non-streaming and SSE streaming, tool calls, <think> reasoning extraction and chat_template_kwargs.enable_thinking.

Multi-architecture support

The same chimere-server runtime is not Qwen-only any more. As of Step 7 (April 2026), it dispatches between two code paths based on the GGUF's general.architecture metadata:

  • Qwen3.5-35B-A3B (qwen35moe) — full production stack: MTP, MRoPE, Engram, agent scheduler, custom Candle / cudarc / libllama paths. This GGUF.
  • Mamba-2 / Nemotron-H MoE / Mamba-1 / Mamba-2 hybrids — libllama-only path via GenericModel. No MTP, no Engram, single-agent only at Step 7. Validated end-to-end on unsloth/Nemotron-3-Nano-30B-A3B-GGUF (Q4_0 and UD-IQ3_XXS) at ~45 tok/s on RTX 5060 Ti, NCMOE=30, ctx 2048, via the bundled test-nemotron smoke binary.

Models that should run via the same Generic path (untested at the chimere level — your mileage may vary): Granite 4.0 H-Tiny / H-Small / H-Micro, Falcon-H1 0.5B – 34B, Bamba-9B v1 / v2, state-spaces/mamba2-*, mistralai/Mamba-Codestral-7B-v0.1, AI21-Jamba-Reasoning-3B.

RAMP Quantization Details

Custom per-tensor quality overrides -- critical paths get higher precision. Overall: ~3.78 BPW.

Tensor Quant BPW Rationale
attn_v (value) Q8_0 8.0 Most critical -- errors cause hallucinations
ssm_alpha, ssm_d Q8_0 8.0 GDN recurrent params, tiny but hypersensitive
attn_k (key) Q6_K 6.5 Important for attention routing
ssm_dt Q6_K 6.5 GDN timestep
token_embd, output Q6_K 6.5 Shared embeddings
attn_q, attn_output Q5_K 5.5 More tolerant
ssm_in, ssm_out Q5_K 5.5 SSM projections
256 MoE experts (FFN) IQ3_S 3.44 80% of params, high MoE redundancy
  • imatrix: Generated on BF16 model (B200, 192 GB VRAM), 200 calibration chunks
  • Result: 15 GB with zero quality loss on agentic benchmarks vs BF16

Training Details

Parameter Value
Base model Qwen/Qwen3.5-35B-A3B (MoE, 256 experts)
Method SFT BF16 LoRA r64, completion-only loss
Dataset 10,191 samples (v1 base + 428 additional: IFEval strict, OPSDC reasoning, instruction following)
Epochs 1 (160 steps, batch 64)
Training GPU NVIDIA B200
Training cost ~$2

v3 dataset additions (on top of v1 base)

  • +50 IFEval strict (5 constraint categories)
  • +30 strict code (no markdown)
  • +30 code gen with thinking
  • +30 instruction following
  • +20 OPSDC-compressed reasoning (-64% tokens)
  • +15 multi-turn agentic

Limitations

  • MTP infrastructure present, gated. This GGUF carries an MTP (multi-token prediction) head — chimere-server detects it via n_nextn_layer = 1 and exposes the speculative-decoding infrastructure (mtp_scheduler.rs, MtpOp FFI). An early March bench on a previous build measured +49.5% token acceptance rate for the MTP draft path; that figure is not currently reproducible because bench_mtp.rs:104-167 has Benchmarks 2 and 5 hard-coded as SKIPPED with the comment crash in ik_llama MTP graph, KV cache issue for layer 41. Until that fix lands the 80 tok/s figure above is the non-MTP path. We will re-publish the MTP gain once the bench passes.
  • Engram is a domain-knowledge overlay, not a measured quality boost. The only saved engram eval in the chimere repo (benchmarks/engram_trained_eval.json) was run on GPT-2 + wikitext-2 and shows a −13.39% PPL regression on that out-of-distribution setup. No Qwen3.5-specific perplexity eval has been published yet. Engram is shipped as an optional per-domain n-gram bias (kine, code, cyber, general); qualitative use shows specialized vocabulary in responses (drainage bronchique postural, EMII, ...) on the kiné domain, but there is no quantitative claim attached to it today.
  • Multi-slot concurrent decoding via ik_llama.cpp is broken under heavy load (ik_llama multi-slot bug, slot 0 contamination of system prompts under contention). The chimere-server production deployment is single-slot. Stock llama-server does NOT have this bug if you need parallel slots.
  • Tool-calling sampler defaults: presence_penalty defaults to 0.0 — a previous default of 1.5 killed code generation and long reasoning blocks. See chimere-server source.

Files

File Size Description
chimere-v3-ramp.gguf 15 GB v3 RAMP GGUF (instructions + reasoning focus)
imatrix.dat 184 MB Importance matrix used for quantization

Related

Citation

@misc{chimere-v3-2026,
  title={Chimere v3: Claude Opus 4.6 Distillation of Qwen3.5-35B-A3B MoE for Instructions and Reasoning},
  author={Kevletesteur},
  year={2026},
  url={https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF}
}
Downloads last month
1,948
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF

Quantized
(206)
this model