Qwen3.5-35B-A3B Chimere v3 -- RAMP GGUF

Chimere v3: Claude Opus 4.6 distillation of Qwen3.5-35B-A3B, optimized for instruction following and reasoning.

RAMP quantization (per-tensor quality overrides + imatrix) -- 15 GB, fits 16 GB VRAM, ~80 tok/s on RTX 5060 Ti.

Looking for v1 (best code + tools)? See Chimere v1 GGUF.

Compatible runtimes

This GGUF can be loaded by any runtime that supports the Qwen3.5-35B-A3B (qwen35moe) architecture. The reference runtime — and the one that exercises all chimere-specific features (Engram n-gram bias, multi-agent context switching, the C++ fast sampler with DRY + min-p, K-cache Hadamard rotation, fused MoE up/gate) — is chimere-server.

Runtime	Engram	Multi-agent	DRY sampler	K-cache Hadamard	Notes
chimere-server (Rust, official)	yes	yes	yes (C++ fast path)	yes	Production target. Also runs Mamba-2 / Nemotron-H MoE through the same backend (PR ikawrakow/ik_llama.cpp#1593).
`ik_llama.cpp` `llama-server`	no	no	optional	optional	Same backend that chimere-server links against, just without the Rust HTTP/sampling layer.
`llama.cpp` stock `llama-server`	no	no	no	no	Works, but slower on Qwen3.5 MoE on our hardware (no `iqk` matmul, no fused MoE up/gate).

Benchmark Results

v3 strengths: instructions and reasoning

Benchmark	v3 RAMP (this repo)	v1 RAMP	Base Qwen3.5-35B-A3B	Notes
IFEval (15 instruction tests)	100%	67%	~91.9%	+33 pts vs v1
Edge cases (15 adversarial tests)	100%	87%	--	Perfect prompt injection resistance
GSM8K CoT 8-shot (1,319 qs)	84.0%	52.2%	--	+32 pts vs v1
HumanEval (30 problems, executed)	83%	97%	--	v1 better here
BFCL tool-calling (20 questions)	75%	90%	67.3%	v1 better here
Speed (RTX 5060 Ti 16 GB, chimere-server)	~80 tok/s	~80 tok/s	--	NCMOE=3, ctx 64K

Qualitative agentic tests

Scenario	v3	v1	/10
Cybersecurity incident response (multi-tool chain)	4	4	10
ML pipeline architecture (RAG, 10K users, $50K budget)	8	8	10
Rust MoE runtime optimization (async prefetch, CUDA)	8	7	10
Total	20	19	30

Honest assessment

Strengths: 100% IFEval, 100% adversarial edge cases, 84% GSM8K, best overall reasoning
Weaknesses: Code generation slightly weaker (83% vs 97%), tool-calling lower (75% vs 90%)
Why: v3 dataset added IFEval-strict, OPSDC-compressed reasoning, and instruction-following samples on top of v1 base. Recommended for general agentic use.

Which version to use?

Use case	Recommended	Why
Instruction following, formatting	v3 (this repo)	100% IFEval, 100% edge cases
Math reasoning	v3 (this repo)	84% GSM8K (vs 52% v1)
Prompt injection resistance	v3 (this repo)	100% adversarial edge cases
Code generation, debugging	v1	97% HumanEval
Tool-calling, function calling	v1	90% BFCL
Re-quantization or fine-tuning	BF16 weights	Full precision

Best of both worlds: Use A-LoRA routing -- an intent classifier selects the appropriate LoRA at runtime. Code/tools queries use v1, instruction/reasoning queries use v3. See Chimere ODO.

Quick start (chimere-server, recommended)

# 1. Backend (one-time): build the ik_llama.cpp fork with sm_120 CUDA + Mamba-2 backport
git clone https://github.com/AIdevsmartdata/ik_llama.cpp.git ~/ik_llama.cpp
cd ~/ik_llama.cpp
git checkout mamba2-nemotron-h-backport
cmake -B build_sm120 -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_NATIVE=OFF
cmake --build build_sm120 -j

# 2. Server
git clone https://github.com/AIdevsmartdata/chimere.git
cd chimere/chimere-server
LD_LIBRARY_PATH=$HOME/ik_llama.cpp/build_sm120/ggml/src:$HOME/ik_llama.cpp/build_sm120/src:/usr/local/cuda-12.8/lib64 \
  cargo build --release --features server --bin chimere-server

# 3. Model + tokenizer
mkdir -p ~/models && cd ~/models
hf download Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF chimere-v3-ramp.gguf
hf download Qwen/Qwen3.5-35B-A3B tokenizer.json --local-dir tokenizers/qwen35

# 4. Run (production env vars)
CHIMERE_MODEL=$PWD/chimere-v3-ramp.gguf \
CHIMERE_TOKENIZER=$PWD/tokenizers/qwen35/tokenizer.json \
CHIMERE_LLAMA_BACKEND=1 \
CHIMERE_NCMOE=3 \
CHIMERE_KV_MAX_SEQ=65536 \
CHIMERE_PORT=8081 \
CHIMERE_FORCE_QWEN35=1 \
LD_LIBRARY_PATH=$HOME/ik_llama.cpp/build_sm120/ggml/src:$HOME/ik_llama.cpp/build_sm120/src:/usr/local/cuda-12.8/lib64 \
~/chimere/chimere-server/target/release/chimere-server

# 5. Hello world
curl -s http://localhost:8081/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":64}'

Engram (optional, prod-only)

Chimere ships an n-gram logit bias overlay loaded from binary .engr tables. To enable it, set:

CHIMERE_ENGRAM_DIR=/path/to/engram_tables   # directory of *.engr files
CHIMERE_ENGRAM_ALPHA=0.1                     # logit bias strength

The engram tables are tokenizer-specific (Qwen3.5 vocab) and used as a per-domain overlay (kine, code, cyber, general). They are intended as a domain-knowledge injector, not a measured quality booster — see the chimere repo README for the honest status of the path.

Quick start (generic GGUF runtimes)

If you do not need the chimere stack, the GGUF works with any Qwen3.5-compatible runtime:

# llama.cpp / llama-server
llama-server \
    -m chimere-v3-ramp.gguf \
    -ngl 99 --n-cpu-moe 4 -c 32768 \
    --flash-attn on --jinja --port 8081

# For 16 GB VRAM (RTX 5060 Ti / RTX 4080):
# Add KV cache quantization to save VRAM:
# -ctk q8_0 -ctv q4_0

Recommended sampling parameters

Mode	temp	top_p	top_k
Thinking (default)	1.0	0.95	20
Thinking + code/tools	0.6	0.95	20
No-think	0.7	0.8	20

Backend

The official chimere-server runtime links against a customized ik_llama.cpp fork (branch mamba2-nemotron-h-backport, head of upstream PR ikawrakow/ik_llama.cpp#1593).

Highlights of the chimere-specific layer on top of ik_llama:

Custom C++ fast sampler exporting sample_token_fast, set_logit_bias, set_engram_bias, clear_engram_bias and take_packed_logprobs — avoids a ~993 KB logits copy per token, packs OpenAI-format top-5 logprobs.
K-cache Hadamard rotation, fused MoE up/gate, grouped expert routing — all enabled by default via cparams.
Multi-agent KV / SSM state save & restore via llama_state_seq_*, keyed on the OpenAI user field. Up to CHIMERE_MAX_AGENTS (default 4) concurrent personas with their own conversation state.
An OpenAI-compatible HTTP layer in Rust (axum 0.8), supporting non-streaming and SSE streaming, tool calls, <think> reasoning extraction and chat_template_kwargs.enable_thinking.

Multi-architecture support

The same chimere-server runtime is not Qwen-only any more. As of Step 7 (April 2026), it dispatches between two code paths based on the GGUF's general.architecture metadata:

Qwen3.5-35B-A3B (qwen35moe) — full production stack: MTP, MRoPE, Engram, agent scheduler, custom Candle / cudarc / libllama paths. This GGUF.
Mamba-2 / Nemotron-H MoE / Mamba-1 / Mamba-2 hybrids — libllama-only path via GenericModel. No MTP, no Engram, single-agent only at Step 7. Validated end-to-end on unsloth/Nemotron-3-Nano-30B-A3B-GGUF (Q4_0 and UD-IQ3_XXS) at ~45 tok/s on RTX 5060 Ti, NCMOE=30, ctx 2048, via the bundled test-nemotron smoke binary.

Models that should run via the same Generic path (untested at the chimere level — your mileage may vary): Granite 4.0 H-Tiny / H-Small / H-Micro, Falcon-H1 0.5B – 34B, Bamba-9B v1 / v2, state-spaces/mamba2-*, mistralai/Mamba-Codestral-7B-v0.1, AI21-Jamba-Reasoning-3B.

RAMP Quantization Details

Custom per-tensor quality overrides -- critical paths get higher precision. Overall: ~3.78 BPW.

Tensor	Quant	BPW	Rationale
attn_v (value)	Q8_0	8.0	Most critical -- errors cause hallucinations
ssm_alpha, ssm_d	Q8_0	8.0	GDN recurrent params, tiny but hypersensitive
attn_k (key)	Q6_K	6.5	Important for attention routing
ssm_dt	Q6_K	6.5	GDN timestep
token_embd, output	Q6_K	6.5	Shared embeddings
attn_q, attn_output	Q5_K	5.5	More tolerant
ssm_in, ssm_out	Q5_K	5.5	SSM projections
256 MoE experts (FFN)	IQ3_S	3.44	80% of params, high MoE redundancy

imatrix: Generated on BF16 model (B200, 192 GB VRAM), 200 calibration chunks
Result: 15 GB with zero quality loss on agentic benchmarks vs BF16

Training Details

Parameter	Value
Base model	Qwen/Qwen3.5-35B-A3B (MoE, 256 experts)
Method	SFT BF16 LoRA r64, completion-only loss
Dataset	10,191 samples (v1 base + 428 additional: IFEval strict, OPSDC reasoning, instruction following)
Epochs	1 (160 steps, batch 64)
Training GPU	NVIDIA B200
Training cost	~$2

v3 dataset additions (on top of v1 base)

+50 IFEval strict (5 constraint categories)
+30 strict code (no markdown)
+30 code gen with thinking
+30 instruction following
+20 OPSDC-compressed reasoning (-64% tokens)
+15 multi-turn agentic

Limitations

MTP infrastructure present, gated. This GGUF carries an MTP (multi-token prediction) head — chimere-server detects it via n_nextn_layer = 1 and exposes the speculative-decoding infrastructure (mtp_scheduler.rs, MtpOp FFI). An early March bench on a previous build measured +49.5% token acceptance rate for the MTP draft path; that figure is not currently reproducible because bench_mtp.rs:104-167 has Benchmarks 2 and 5 hard-coded as SKIPPED with the comment crash in ik_llama MTP graph, KV cache issue for layer 41. Until that fix lands the 80 tok/s figure above is the non-MTP path. We will re-publish the MTP gain once the bench passes.
Engram is a domain-knowledge overlay, not a measured quality boost. The only saved engram eval in the chimere repo (benchmarks/engram_trained_eval.json) was run on GPT-2 + wikitext-2 and shows a −13.39% PPL regression on that out-of-distribution setup. No Qwen3.5-specific perplexity eval has been published yet. Engram is shipped as an optional per-domain n-gram bias (kine, code, cyber, general); qualitative use shows specialized vocabulary in responses (drainage bronchique postural, EMII, ...) on the kiné domain, but there is no quantitative claim attached to it today.
Multi-slot concurrent decoding via ik_llama.cpp is broken under heavy load (ik_llama multi-slot bug, slot 0 contamination of system prompts under contention). The chimere-server production deployment is single-slot. Stock llama-server does NOT have this bug if you need parallel slots.
Tool-calling sampler defaults: presence_penalty defaults to 0.0 — a previous default of 1.5 killed code generation and long reasoning blocks. See chimere-server source.

Files

File	Size	Description
`chimere-v3-ramp.gguf`	15 GB	v3 RAMP GGUF (instructions + reasoning focus)
`imatrix.dat`	184 MB	Importance matrix used for quantization

chimere -- Official Rust runtime (chimere-server) with Engram, MTP, multi-agent, multi-arch dispatch
ik_llama.cpp fork -- Backend with Mamba-2 + Nemotron-H backport (PR #1593)
Chimere v1 GGUF -- Best code + tools
BF16 full weights -- For re-quantization or fine-tuning
LoRA adapter -- For further training
Chimere ODO -- A-LoRA intent routing

Citation

@misc{chimere-v3-2026,
  title={Chimere v3: Claude Opus 4.6 Distillation of Qwen3.5-35B-A3B MoE for Instructions and Reasoning},
  author={Kevletesteur},
  year={2026},
  url={https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF}
}

Downloads last month: 1,948

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Quantized

(206)

this model

Kevletesteur
/

Qwen3.5-35B-A3B-Chimere-v3-GGUF