Note: These models are optimized for use within an agentic harness (e.g. Hermes Agent) and may behave unexpectedly in raw inference without a system prompt. Capability benchmarks are strong but conversational behavior outside of a structured harness is not reliable. I am currently working on v2 to address this and reduce harness dependency.
Support This Work
I'm a PhD student who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. It's a hobby that got out of hand.
If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.
Hermes Qwen3.5 35B-A3B GGUF
GGUF quantizations of a Qwen3.5-35B-A3B model fine-tuned on NousResearch/hermes-function-calling-v1 for structured function calling and tool use.
Base Model
- Architecture: Qwen3.5 MoE (Mixture of Experts) — 35B total parameters, ~3B active per token
- Base: Qwen/Qwen3.5-35B-A3B
- Context Length: 262,144 tokens
- Experts: 256 total, 8 active per token
Fine-Tuning Details
- Method: LoRA via Unsloth + TRL SFTTrainer
- LoRA Rank (r): 32
- LoRA Alpha: 32
- LoRA Dropout: 0
- Target Modules:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj - Training Precision: bf16
- Optimizer: AdamW 8-bit
- Learning Rate: 2e-4 with cosine scheduler
- Warmup Steps: 10
- Epochs: 3
- Batch Size: 2 per device, 8 gradient accumulation steps (effective batch size 16)
- Max Sequence Length: 4,096 tokens
- Weight Decay: 0.01
- PEFT Version: 0.18.1
Training Dataset
NousResearch/hermes-function-calling-v1 — a function-calling dataset following the Hermes Function-calling Standard. Includes:
- Cleaned Glaive Function Calling samples
- Advanced JSON structured output (agentic, multi-turn)
- Single-turn JSON structured output samples
Conversations were formatted using ChatML (<|im_start|> / <|im_end|>) with role mapping: system, human -> user, gpt -> assistant, tool.
Quantization
All quantizations were produced using llama.cpp with an importance matrix (imatrix) computed from WikiText-2 calibration data for improved quality at lower bit depths.
Available Quants
| Filename | Quant | Type | Size |
|---|---|---|---|
| hermes-qwen3.5-35b-a3b-f16.gguf | F16 | Full precision | 64.6 GB |
| hermes-qwen3.5-35b-a3b-Q8_0.gguf | Q8_0 | Standard | 36.9 GB |
| hermes-qwen3.5-35b-a3b-Q6_K.gguf | Q6_K | K-quant | 28.5 GB |
| hermes-qwen3.5-35b-a3b-Q5_K_M.gguf | Q5_K_M | K-quant | 24.7 GB |
| hermes-qwen3.5-35b-a3b-Q5_K_S.gguf | Q5_K_S | K-quant | 24.0 GB |
| hermes-qwen3.5-35b-a3b-Q4_K_M.gguf | Q4_K_M | K-quant | 21.2 GB |
| hermes-qwen3.5-35b-a3b-Q4_K_S.gguf | Q4_K_S | K-quant | 19.9 GB |
| hermes-qwen3.5-35b-a3b-IQ4_NL.gguf | IQ4_NL | imatrix | 19.8 GB |
| hermes-qwen3.5-35b-a3b-IQ4_XS.gguf | IQ4_XS | imatrix | 18.7 GB |
| hermes-qwen3.5-35b-a3b-Q3_K_M.gguf | Q3_K_M | K-quant | 16.8 GB |
| hermes-qwen3.5-35b-a3b-IQ3_M.gguf | IQ3_M | imatrix | 15.4 GB |
| hermes-qwen3.5-35b-a3b-IQ3_S.gguf | IQ3_S | imatrix | 15.3 GB |
| hermes-qwen3.5-35b-a3b-Q3_K_S.gguf | Q3_K_S | K-quant | 15.2 GB |
| hermes-qwen3.5-35b-a3b-IQ3_XXS.gguf | IQ3_XXS | imatrix | 13.6 GB |
| hermes-qwen3.5-35b-a3b-IQ2_M.gguf | IQ2_M | imatrix | 11.7 GB |
| hermes-qwen3.5-35b-a3b-IQ2_S.gguf | IQ2_S | imatrix | 10.7 GB |
| hermes-qwen3.5-35b-a3b-IQ2_XXS.gguf | IQ2_XXS | imatrix | 9.5 GB |
| hermes-qwen3.5-35b-a3b-IQ1_M.gguf | IQ1_M | imatrix | 8.2 GB |
| hermes-qwen3.5-35b-a3b-IQ1_S.gguf | IQ1_S | imatrix | 7.5 GB |
All quantizations verified: 733 tensors, GGUF v3.
Choosing a Quant
- Q8_0 (36.9 GB): Closest to full precision. Use if you have the VRAM/RAM.
- Q6_K / Q5_K_M (28.5 / 24.7 GB): Good balance of quality and size for most use cases.
- Q4_K_M (21.2 GB): Popular sweet spot — significant size reduction with minimal quality loss.
- IQ4_NL / IQ4_XS (19.8 / 18.7 GB): Importance-matrix 4-bit — can outperform standard Q4 quants at similar size.
- IQ3_M / IQ3_S (15.4 / 15.3 GB): Importance-matrix 3-bit — good quality for the size with imatrix calibration.
- IQ2_M and below (11.7 GB and smaller): Extreme compression with imatrix. Quality degrades progressively.
- IQ1_M / IQ1_S (8.2 / 7.5 GB): Maximum compression. Expect significant quality loss.
- IQ3_M and below: For constrained environments. Quality degrades progressively.
- IQ2 / IQ1: Extreme compression. Expect notable quality degradation.
Usage
llama.cpp
llama-cli -m hermes-qwen3.5-35b-a3b-Q4_K_M.gguf -p "You are a helpful assistant." -cnv
LM Studio / Ollama / KoboldCpp
Download any GGUF file and load it directly.
Credits
- Base Model: Qwen Team
- Training Dataset: NousResearch
- Fine-Tuning Framework: Unsloth
- Quantization Tooling: llama.cpp
- Downloads last month
- 18,108
1-bit
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit