Darwin-35B-A3B-Opus-Q8_0-GGUF
Q8_0 GGUF of Darwin-35B-A3B-Opus | ~37GB (3 shards) | GPQA Diamond 90.0% | Near-lossless quality | MoE 35B (3B active) | 201 Languages | 262K Context | Apache 2.0
About This Quantization
Q8_0 GGUF of FINAL-Bench/Darwin-35B-A3B-Opus.
| Original (BF16) | This Model (Q8_0 GGUF) | |
|---|---|---|
| Format | SafeTensors | GGUF |
| Size | 65.5 GB | ~37 GB (3 shards) |
| Quality | Baseline | Near-lossless (~99.9% of BF16) |
| VRAM Required | 65+ GB | ~37 GB |
| Runs on | H100, A100 80GB | A100 40GB, Mac 64GB, 2x RTX 4090 |
| Framework | Transformers, vLLM, SGLang | llama.cpp, Ollama, LM Studio |
Files
| File | Size | Description |
|---|---|---|
darwin-35b-a3b-opus-q8_0-00001-of-00003.gguf |
~13.6 GB | Shard 1 of 3 |
darwin-35b-a3b-opus-q8_0-00002-of-00003.gguf |
~12.5 GB | Shard 2 of 3 |
darwin-35b-a3b-opus-q8_0-00003-of-00003.gguf |
~10.7 GB | Shard 3 of 3 |
| Total | ~36.8 GB | All 3 shards required |
Download all 3 shard files. llama.cpp and Ollama will automatically load them together.
Hardware Requirements
| Setup | Memory | Status |
|---|---|---|
| NVIDIA A100 40GB | 40 GB VRAM | Fits |
| NVIDIA A100 80GB | 80 GB VRAM | Comfortable |
| NVIDIA H100 93GB | 93 GB VRAM | Comfortable |
| 2x RTX 4090 (24GB each) | 48 GB VRAM | With tensor parallel |
| Mac Studio M2/M3 Ultra 64GB | 64 GB Unified | Fits |
| Mac M3 Max 48GB | 48 GB Unified | Fits |
| Single RTX 4090 24GB | 24 GB VRAM | Insufficient (use Q4_K_M) |
As a MoE model, only 3B parameters are active per token. Inference is fast despite the 37GB model size.
Usage
llama.cpp (CLI)
llama-cli \
--hf-repo FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF \
--hf-file darwin-35b-a3b-opus-q8_0-00001-of-00003.gguf \
-p "The meaning to life and the universe is" \
-n 512 -ngl 99
llama.cpp (Server)
llama-server \
--hf-repo FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF \
--hf-file darwin-35b-a3b-opus-q8_0-00001-of-00003.gguf \
-c 32768 -ngl 99
Ollama
echo 'FROM ./darwin-35b-a3b-opus-q8_0-00001-of-00003.gguf' > Modelfile
ollama create darwin-opus -f Modelfile
ollama run darwin-opus
LM Studio
- Download all 3
.ggufshard files - Place them in the same folder
- Open LM Studio, load the first shard
- LM Studio auto-detects and loads all shards
MoE Expert Offload (Limited VRAM)
llama-cli \
--hf-repo FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF \
--hf-file darwin-35b-a3b-opus-q8_0-00001-of-00003.gguf \
-ot ".ffn_.*_exps.=CPU" \
-ngl 99 -c 32768
Benchmark Results (Original Model)
Q8_0 preserves near-identical performance to BF16.
GPQA Diamond (198 Questions, Graduate-Level Reasoning)
| Model | Accuracy |
|---|---|
| Darwin-35B-A3B-Opus | 90.0% |
| Mother (Jackrong Claude 4.6 Opus Distilled) | 85.0% |
| Father (Qwen3.5-35B-A3B Official) | 84.2% |
MMMLU (Multilingual Knowledge, 29 Languages)
| Model | Accuracy |
|---|---|
| Darwin-35B-A3B-Opus | 85.0% |
| Father (Qwen3.5-35B-A3B Official) | 85.2% |
How Darwin Was Created
Darwin-35B-A3B-Opus was created using Darwin V5, a diagnostic-guided evolutionary merge engine built on mergekit.
Both parent models share the identical Qwen3.5-35B-A3B architecture. The Mother is a LoRA SFT on the same base โ not a different architecture.
Darwin V5 adds three phases over standard mergekit evolve:
- Pre-merge parent profiling (40 layers x 256 experts: activation frequency, routing entropy, probe cosine distance)
- Evolution with diagnostic-informed initial population and constrained search space
- Post-merge child validation (layer-by-layer comparison against both parents)
Key diagnostic finding: Mother had 50-65% dead experts (activation < 5%) from text-only LoRA SFT. Darwin compensated by reducing Mother density and using Father's living experts to fill inactive slots.
Merge configuration:
# Method: DARE-TIES via mergekit
L0-L37: t=0.5988 (Mother 60%) โ router from Mother
L38: t=0.9000 (Mother 90%) โ reasoning core (peak probe cosine distance)
L39: t=0.5336 (Father 47%) โ router from Father (output routing)
For full technical details, diagnostics, and health check results, see the original model card.
Other Quantizations
| Quantization | Size | Quality | Use Case |
|---|---|---|---|
| Q8_0 (this) | ~37 GB | Near-lossless | Maximum quality |
| Q4_K_M (coming soon) | ~20 GB | Good | RTX 4090, Mac 32GB |
Model Specifications
| Base Model | FINAL-Bench/Darwin-35B-A3B-Opus |
| Architecture | Qwen3.5 MoE (Gated DeltaNet + MoE) |
| Total Parameters | 35B |
| Active Parameters | 3B per forward pass |
| Experts | 256 (8 routed + 1 shared active) |
| Context Length | 262,144 native |
| Languages | 201 |
| Quantization | Q8_0 (8-bit integer) |
| GGUF Shards | 3 files |
| License | Apache 2.0 |
| Quantized by | VIDRAFT via llama.cpp |
Acknowledgements
- Korean Government โ GPU Support Program research grant
- Qwen Team โ Qwen3.5-35B-A3B base architecture
- Jackrong โ Claude 4.6 Opus Reasoning Distilled model
- mergekit โ Merge backend infrastructure
- llama.cpp โ GGUF conversion and quantization
Citation
@misc{vidraft_darwin_35b_opus_gguf,
title = {Darwin-35B-A3B-Opus-Q8_0-GGUF},
author = {VIDRAFT},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF}}
}
- Downloads last month
- 925
8-bit
Model tree for FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF
Base model
FINAL-Bench/Darwin-35B-A3B-OpusSpace using FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF 1
Evaluation results
- Accuracy on GPQA Diamondself-reported90.000
- Accuracy on MMMLUself-reported85.000