Note: These models are optimized for use within an agentic harness (e.g. Hermes Agent) and may behave unexpectedly in raw inference without a system prompt. Capability benchmarks are strong but conversational behavior outside of a structured harness is not reliable. I am currently working on v2 to address this and reduce harness dependency.

Support This Work

I'm a PhD student who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. It's a hobby that got out of hand.

If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.

☕ ko-fi.com/djlougen

Hermes Qwen3.5 35B-A3B GGUF

GGUF quantizations of a Qwen3.5-35B-A3B model fine-tuned on NousResearch/hermes-function-calling-v1 for structured function calling and tool use.

Base Model

Architecture: Qwen3.5 MoE (Mixture of Experts) — 35B total parameters, ~3B active per token
Base: Qwen/Qwen3.5-35B-A3B
Context Length: 262,144 tokens
Experts: 256 total, 8 active per token

Fine-Tuning Details

Method: LoRA via Unsloth + TRL SFTTrainer
LoRA Rank (r): 32
LoRA Alpha: 32
LoRA Dropout: 0
Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training Precision: bf16
Optimizer: AdamW 8-bit
Learning Rate: 2e-4 with cosine scheduler
Warmup Steps: 10
Epochs: 3
Batch Size: 2 per device, 8 gradient accumulation steps (effective batch size 16)
Max Sequence Length: 4,096 tokens
Weight Decay: 0.01
PEFT Version: 0.18.1

Training Dataset

NousResearch/hermes-function-calling-v1 — a function-calling dataset following the Hermes Function-calling Standard. Includes:

Cleaned Glaive Function Calling samples
Advanced JSON structured output (agentic, multi-turn)
Single-turn JSON structured output samples

Conversations were formatted using ChatML (<|im_start|> / <|im_end|>) with role mapping: system, human -> user, gpt -> assistant, tool.

Quantization

All quantizations were produced using llama.cpp with an importance matrix (imatrix) computed from WikiText-2 calibration data for improved quality at lower bit depths.

Available Quants

Filename	Quant	Type	Size
hermes-qwen3.5-35b-a3b-f16.gguf	F16	Full precision	64.6 GB
hermes-qwen3.5-35b-a3b-Q8_0.gguf	Q8_0	Standard	36.9 GB
hermes-qwen3.5-35b-a3b-Q6_K.gguf	Q6_K	K-quant	28.5 GB
hermes-qwen3.5-35b-a3b-Q5_K_M.gguf	Q5_K_M	K-quant	24.7 GB
hermes-qwen3.5-35b-a3b-Q5_K_S.gguf	Q5_K_S	K-quant	24.0 GB
hermes-qwen3.5-35b-a3b-Q4_K_M.gguf	Q4_K_M	K-quant	21.2 GB
hermes-qwen3.5-35b-a3b-Q4_K_S.gguf	Q4_K_S	K-quant	19.9 GB
hermes-qwen3.5-35b-a3b-IQ4_NL.gguf	IQ4_NL	imatrix	19.8 GB
hermes-qwen3.5-35b-a3b-IQ4_XS.gguf	IQ4_XS	imatrix	18.7 GB
hermes-qwen3.5-35b-a3b-Q3_K_M.gguf	Q3_K_M	K-quant	16.8 GB
hermes-qwen3.5-35b-a3b-IQ3_M.gguf	IQ3_M	imatrix	15.4 GB
hermes-qwen3.5-35b-a3b-IQ3_S.gguf	IQ3_S	imatrix	15.3 GB
hermes-qwen3.5-35b-a3b-Q3_K_S.gguf	Q3_K_S	K-quant	15.2 GB
hermes-qwen3.5-35b-a3b-IQ3_XXS.gguf	IQ3_XXS	imatrix	13.6 GB
hermes-qwen3.5-35b-a3b-IQ2_M.gguf	IQ2_M	imatrix	11.7 GB
hermes-qwen3.5-35b-a3b-IQ2_S.gguf	IQ2_S	imatrix	10.7 GB
hermes-qwen3.5-35b-a3b-IQ2_XXS.gguf	IQ2_XXS	imatrix	9.5 GB
hermes-qwen3.5-35b-a3b-IQ1_M.gguf	IQ1_M	imatrix	8.2 GB
hermes-qwen3.5-35b-a3b-IQ1_S.gguf	IQ1_S	imatrix	7.5 GB

All quantizations verified: 733 tensors, GGUF v3.

Choosing a Quant

Q8_0 (36.9 GB): Closest to full precision. Use if you have the VRAM/RAM.
Q6_K / Q5_K_M (28.5 / 24.7 GB): Good balance of quality and size for most use cases.
Q4_K_M (21.2 GB): Popular sweet spot — significant size reduction with minimal quality loss.
IQ4_NL / IQ4_XS (19.8 / 18.7 GB): Importance-matrix 4-bit — can outperform standard Q4 quants at similar size.
IQ3_M / IQ3_S (15.4 / 15.3 GB): Importance-matrix 3-bit — good quality for the size with imatrix calibration.
IQ2_M and below (11.7 GB and smaller): Extreme compression with imatrix. Quality degrades progressively.
IQ1_M / IQ1_S (8.2 / 7.5 GB): Maximum compression. Expect significant quality loss.
IQ3_M and below: For constrained environments. Quality degrades progressively.
IQ2 / IQ1: Extreme compression. Expect notable quality degradation.

Usage

llama.cpp

llama-cli -m hermes-qwen3.5-35b-a3b-Q4_K_M.gguf -p "You are a helpful assistant." -cnv

LM Studio / Ollama / KoboldCpp

Download any GGUF file and load it directly.

Credits

Base Model: Qwen Team
Training Dataset: NousResearch
Fine-Tuning Framework: Unsloth
Quantization Tooling: llama.cpp

Downloads last month: 18,108

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for DJLougen/hermes-qwen3.5-35b-a3b-GGUF

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Adapter

(21)

this model

DJLougen
/

hermes-qwen3.5-35b-a3b-GGUF