Darwin-35B-A3B-Opus-Q8_0-GGUF

Q8_0 GGUF of Darwin-35B-A3B-Opus | ~37GB (3 shards) | GPQA Diamond 90.0% | Near-lossless quality | MoE 35B (3B active) | 201 Languages | 262K Context | Apache 2.0

About This Quantization

Q8_0 GGUF of FINAL-Bench/Darwin-35B-A3B-Opus.

	Original (BF16)	This Model (Q8_0 GGUF)
Format	SafeTensors	GGUF
Size	65.5 GB	~37 GB (3 shards)
Quality	Baseline	Near-lossless (~99.9% of BF16)
VRAM Required	65+ GB	~37 GB
Runs on	H100, A100 80GB	A100 40GB, Mac 64GB, 2x RTX 4090
Framework	Transformers, vLLM, SGLang	llama.cpp, Ollama, LM Studio

Files

File	Size	Description
`darwin-35b-a3b-opus-q8_0-00001-of-00003.gguf`	~13.6 GB	Shard 1 of 3
`darwin-35b-a3b-opus-q8_0-00002-of-00003.gguf`	~12.5 GB	Shard 2 of 3
`darwin-35b-a3b-opus-q8_0-00003-of-00003.gguf`	~10.7 GB	Shard 3 of 3
Total	~36.8 GB	All 3 shards required

Download all 3 shard files. llama.cpp and Ollama will automatically load them together.

Hardware Requirements

Setup	Memory	Status
NVIDIA A100 40GB	40 GB VRAM	Fits
NVIDIA A100 80GB	80 GB VRAM	Comfortable
NVIDIA H100 93GB	93 GB VRAM	Comfortable
2x RTX 4090 (24GB each)	48 GB VRAM	With tensor parallel
Mac Studio M2/M3 Ultra 64GB	64 GB Unified	Fits
Mac M3 Max 48GB	48 GB Unified	Fits
Single RTX 4090 24GB	24 GB VRAM	Insufficient (use Q4_K_M)

As a MoE model, only 3B parameters are active per token. Inference is fast despite the 37GB model size.

Usage

llama.cpp (CLI)

llama-cli \
  --hf-repo FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF \
  --hf-file darwin-35b-a3b-opus-q8_0-00001-of-00003.gguf \
  -p "The meaning to life and the universe is" \
  -n 512 -ngl 99

llama.cpp (Server)

llama-server \
  --hf-repo FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF \
  --hf-file darwin-35b-a3b-opus-q8_0-00001-of-00003.gguf \
  -c 32768 -ngl 99

Ollama

echo 'FROM ./darwin-35b-a3b-opus-q8_0-00001-of-00003.gguf' > Modelfile
ollama create darwin-opus -f Modelfile
ollama run darwin-opus

LM Studio

Download all 3 .gguf shard files
Place them in the same folder
Open LM Studio, load the first shard
LM Studio auto-detects and loads all shards

MoE Expert Offload (Limited VRAM)

llama-cli \
  --hf-repo FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF \
  --hf-file darwin-35b-a3b-opus-q8_0-00001-of-00003.gguf \
  -ot ".ffn_.*_exps.=CPU" \
  -ngl 99 -c 32768

Benchmark Results (Original Model)

Q8_0 preserves near-identical performance to BF16.

GPQA Diamond (198 Questions, Graduate-Level Reasoning)

Model	Accuracy
Darwin-35B-A3B-Opus	90.0%
Mother (Jackrong Claude 4.6 Opus Distilled)	85.0%
Father (Qwen3.5-35B-A3B Official)	84.2%

MMMLU (Multilingual Knowledge, 29 Languages)

Model	Accuracy
Darwin-35B-A3B-Opus	85.0%
Father (Qwen3.5-35B-A3B Official)	85.2%

How Darwin Was Created

Darwin-35B-A3B-Opus was created using Darwin V5, a diagnostic-guided evolutionary merge engine built on mergekit.

Both parent models share the identical Qwen3.5-35B-A3B architecture. The Mother is a LoRA SFT on the same base — not a different architecture.

Darwin V5 adds three phases over standard mergekit evolve:

Pre-merge parent profiling (40 layers x 256 experts: activation frequency, routing entropy, probe cosine distance)
Evolution with diagnostic-informed initial population and constrained search space
Post-merge child validation (layer-by-layer comparison against both parents)

Key diagnostic finding: Mother had 50-65% dead experts (activation < 5%) from text-only LoRA SFT. Darwin compensated by reducing Mother density and using Father's living experts to fill inactive slots.

Merge configuration:

# Method: DARE-TIES via mergekit
L0-L37:  t=0.5988 (Mother 60%) — router from Mother
L38:     t=0.9000 (Mother 90%) — reasoning core (peak probe cosine distance)
L39:     t=0.5336 (Father 47%) — router from Father (output routing)

For full technical details, diagnostics, and health check results, see the original model card.

Other Quantizations

Quantization	Size	Quality	Use Case
Q8_0 (this)	~37 GB	Near-lossless	Maximum quality
Q4_K_M (coming soon)	~20 GB	Good	RTX 4090, Mac 32GB

Model Specifications


Base Model	FINAL-Bench/Darwin-35B-A3B-Opus
Architecture	Qwen3.5 MoE (Gated DeltaNet + MoE)
Total Parameters	35B
Active Parameters	3B per forward pass
Experts	256 (8 routed + 1 shared active)
Context Length	262,144 native
Languages	201
Quantization	Q8_0 (8-bit integer)
GGUF Shards	3 files
License	Apache 2.0
Quantized by	VIDRAFT via llama.cpp

Acknowledgements

Korean Government — GPU Support Program research grant
Qwen Team — Qwen3.5-35B-A3B base architecture
Jackrong — Claude 4.6 Opus Reasoning Distilled model
mergekit — Merge backend infrastructure
llama.cpp — GGUF conversion and quantization

Citation

@misc{vidraft_darwin_35b_opus_gguf,
  title        = {Darwin-35B-A3B-Opus-Q8_0-GGUF},
  author       = {VIDRAFT},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF}}
}

Downloads last month: 925

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

8-bit

Model tree for FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF

Base model

FINAL-Bench/Darwin-35B-A3B-Opus

Quantized

(8)

this model

Space using FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF 1

Evaluation results

Accuracy on GPQA Diamond
self-reported

90.000
Accuracy on MMMLU
self-reported

85.000