---
base_model: Qwen/Qwen3.5-35B-A3B
tags:
- qwen3.5
- moe
- gguf
- lora
- function-calling
- hermes
- unsloth
license: apache-2.0
datasets:
- NousResearch/hermes-function-calling-v1
pipeline_tag: text-generation
model-index:
- name: hermes-qwen3.5-35b-a3b-GGUF
results: []
---
Note: These models are optimized for use within an agentic harness (e.g. Hermes Agent) and may behave unexpectedly in raw inference without a system prompt. Capability benchmarks are strong but conversational behavior outside of a structured harness is not reliable. I am currently working on v2 to address this and reduce harness dependency.
Support This Work
I'm a PhD student who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. It's a hobby that got out of hand.
If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.
☕ ko-fi.com/djlougen
---
# Hermes Qwen3.5 35B-A3B GGUF
GGUF quantizations of a Qwen3.5-35B-A3B model fine-tuned on [NousResearch/hermes-function-calling-v1](https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1) for structured function calling and tool use.
## Base Model
- **Architecture:** Qwen3.5 MoE (Mixture of Experts) — 35B total parameters, ~3B active per token
- **Base:** [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B)
- **Context Length:** 262,144 tokens
- **Experts:** 256 total, 8 active per token
## Fine-Tuning Details
- **Method:** LoRA via [Unsloth](https://github.com/unslothai/unsloth) + TRL SFTTrainer
- **LoRA Rank (r):** 32
- **LoRA Alpha:** 32
- **LoRA Dropout:** 0
- **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
- **Training Precision:** bf16
- **Optimizer:** AdamW 8-bit
- **Learning Rate:** 2e-4 with cosine scheduler
- **Warmup Steps:** 10
- **Epochs:** 3
- **Batch Size:** 2 per device, 8 gradient accumulation steps (effective batch size 16)
- **Max Sequence Length:** 4,096 tokens
- **Weight Decay:** 0.01
- **PEFT Version:** 0.18.1
### Training Dataset
[NousResearch/hermes-function-calling-v1](https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1) — a function-calling dataset following the Hermes Function-calling Standard. Includes:
- Cleaned Glaive Function Calling samples
- Advanced JSON structured output (agentic, multi-turn)
- Single-turn JSON structured output samples
Conversations were formatted using ChatML (`<|im_start|>` / `<|im_end|>`) with role mapping: `system`, `human` -> `user`, `gpt` -> `assistant`, `tool`.
## Quantization
All quantizations were produced using [llama.cpp](https://github.com/ggerganov/llama.cpp) with an **importance matrix** (imatrix) computed from WikiText-2 calibration data for improved quality at lower bit depths.
### Available Quants
| Filename | Quant | Type | Size |
|----------|-------|------|------|
| hermes-qwen3.5-35b-a3b-f16.gguf | F16 | Full precision | 64.6 GB |
| hermes-qwen3.5-35b-a3b-Q8_0.gguf | Q8_0 | Standard | 36.9 GB |
| hermes-qwen3.5-35b-a3b-Q6_K.gguf | Q6_K | K-quant | 28.5 GB |
| hermes-qwen3.5-35b-a3b-Q5_K_M.gguf | Q5_K_M | K-quant | 24.7 GB |
| hermes-qwen3.5-35b-a3b-Q5_K_S.gguf | Q5_K_S | K-quant | 24.0 GB |
| hermes-qwen3.5-35b-a3b-Q4_K_M.gguf | Q4_K_M | K-quant | 21.2 GB |
| hermes-qwen3.5-35b-a3b-Q4_K_S.gguf | Q4_K_S | K-quant | 19.9 GB |
| hermes-qwen3.5-35b-a3b-IQ4_NL.gguf | IQ4_NL | imatrix | 19.8 GB |
| hermes-qwen3.5-35b-a3b-IQ4_XS.gguf | IQ4_XS | imatrix | 18.7 GB |
| hermes-qwen3.5-35b-a3b-Q3_K_M.gguf | Q3_K_M | K-quant | 16.8 GB |
| hermes-qwen3.5-35b-a3b-IQ3_M.gguf | IQ3_M | imatrix | 15.4 GB |
| hermes-qwen3.5-35b-a3b-IQ3_S.gguf | IQ3_S | imatrix | 15.3 GB |
| hermes-qwen3.5-35b-a3b-Q3_K_S.gguf | Q3_K_S | K-quant | 15.2 GB |
| hermes-qwen3.5-35b-a3b-IQ3_XXS.gguf | IQ3_XXS | imatrix | 13.6 GB |
| hermes-qwen3.5-35b-a3b-IQ2_M.gguf | IQ2_M | imatrix | 11.7 GB |
| hermes-qwen3.5-35b-a3b-IQ2_S.gguf | IQ2_S | imatrix | 10.7 GB |
| hermes-qwen3.5-35b-a3b-IQ2_XXS.gguf | IQ2_XXS | imatrix | 9.5 GB |
| hermes-qwen3.5-35b-a3b-IQ1_M.gguf | IQ1_M | imatrix | 8.2 GB |
| hermes-qwen3.5-35b-a3b-IQ1_S.gguf | IQ1_S | imatrix | 7.5 GB |
All quantizations verified: 733 tensors, GGUF v3.
### Choosing a Quant
- **Q8_0** (36.9 GB): Closest to full precision. Use if you have the VRAM/RAM.
- **Q6_K / Q5_K_M** (28.5 / 24.7 GB): Good balance of quality and size for most use cases.
- **Q4_K_M** (21.2 GB): Popular sweet spot — significant size reduction with minimal quality loss.
- **IQ4_NL / IQ4_XS** (19.8 / 18.7 GB): Importance-matrix 4-bit — can outperform standard Q4 quants at similar size.
- **IQ3_M / IQ3_S** (15.4 / 15.3 GB): Importance-matrix 3-bit — good quality for the size with imatrix calibration.
- **IQ2_M and below** (11.7 GB and smaller): Extreme compression with imatrix. Quality degrades progressively.
- **IQ1_M / IQ1_S** (8.2 / 7.5 GB): Maximum compression. Expect significant quality loss.
- **IQ3_M and below**: For constrained environments. Quality degrades progressively.
- **IQ2 / IQ1**: Extreme compression. Expect notable quality degradation.
## Usage
### llama.cpp
```bash
llama-cli -m hermes-qwen3.5-35b-a3b-Q4_K_M.gguf -p "You are a helpful assistant." -cnv
```
### LM Studio / Ollama / KoboldCpp
Download any GGUF file and load it directly.
## Credits
- **Base Model:** [Qwen Team](https://huggingface.co/Qwen)
- **Training Dataset:** [NousResearch](https://huggingface.co/NousResearch)
- **Fine-Tuning Framework:** [Unsloth](https://github.com/unslothai/unsloth)
- **Quantization Tooling:** [llama.cpp](https://github.com/ggerganov/llama.cpp)