Note: These models are optimized for use within an agentic harness (e.g. Hermes Agent) and may behave unexpectedly in raw inference without a system prompt. Capability benchmarks are strong but conversational behavior outside of a structured harness is not reliable. I am currently working on v2 to address this and reduce harness dependency.

Support This Work

I'm a PhD student who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. It's a hobby that got out of hand.

If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.

☕ ko-fi.com/djlougen


Hermes Qwen3.5 35B-A3B GGUF

GGUF quantizations of a Qwen3.5-35B-A3B model fine-tuned on NousResearch/hermes-function-calling-v1 for structured function calling and tool use.

Base Model

  • Architecture: Qwen3.5 MoE (Mixture of Experts) — 35B total parameters, ~3B active per token
  • Base: Qwen/Qwen3.5-35B-A3B
  • Context Length: 262,144 tokens
  • Experts: 256 total, 8 active per token

Fine-Tuning Details

  • Method: LoRA via Unsloth + TRL SFTTrainer
  • LoRA Rank (r): 32
  • LoRA Alpha: 32
  • LoRA Dropout: 0
  • Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Training Precision: bf16
  • Optimizer: AdamW 8-bit
  • Learning Rate: 2e-4 with cosine scheduler
  • Warmup Steps: 10
  • Epochs: 3
  • Batch Size: 2 per device, 8 gradient accumulation steps (effective batch size 16)
  • Max Sequence Length: 4,096 tokens
  • Weight Decay: 0.01
  • PEFT Version: 0.18.1

Training Dataset

NousResearch/hermes-function-calling-v1 — a function-calling dataset following the Hermes Function-calling Standard. Includes:

  • Cleaned Glaive Function Calling samples
  • Advanced JSON structured output (agentic, multi-turn)
  • Single-turn JSON structured output samples

Conversations were formatted using ChatML (<|im_start|> / <|im_end|>) with role mapping: system, human -> user, gpt -> assistant, tool.

Quantization

All quantizations were produced using llama.cpp with an importance matrix (imatrix) computed from WikiText-2 calibration data for improved quality at lower bit depths.

Available Quants

Filename Quant Type Size
hermes-qwen3.5-35b-a3b-f16.gguf F16 Full precision 64.6 GB
hermes-qwen3.5-35b-a3b-Q8_0.gguf Q8_0 Standard 36.9 GB
hermes-qwen3.5-35b-a3b-Q6_K.gguf Q6_K K-quant 28.5 GB
hermes-qwen3.5-35b-a3b-Q5_K_M.gguf Q5_K_M K-quant 24.7 GB
hermes-qwen3.5-35b-a3b-Q5_K_S.gguf Q5_K_S K-quant 24.0 GB
hermes-qwen3.5-35b-a3b-Q4_K_M.gguf Q4_K_M K-quant 21.2 GB
hermes-qwen3.5-35b-a3b-Q4_K_S.gguf Q4_K_S K-quant 19.9 GB
hermes-qwen3.5-35b-a3b-IQ4_NL.gguf IQ4_NL imatrix 19.8 GB
hermes-qwen3.5-35b-a3b-IQ4_XS.gguf IQ4_XS imatrix 18.7 GB
hermes-qwen3.5-35b-a3b-Q3_K_M.gguf Q3_K_M K-quant 16.8 GB
hermes-qwen3.5-35b-a3b-IQ3_M.gguf IQ3_M imatrix 15.4 GB
hermes-qwen3.5-35b-a3b-IQ3_S.gguf IQ3_S imatrix 15.3 GB
hermes-qwen3.5-35b-a3b-Q3_K_S.gguf Q3_K_S K-quant 15.2 GB
hermes-qwen3.5-35b-a3b-IQ3_XXS.gguf IQ3_XXS imatrix 13.6 GB
hermes-qwen3.5-35b-a3b-IQ2_M.gguf IQ2_M imatrix 11.7 GB
hermes-qwen3.5-35b-a3b-IQ2_S.gguf IQ2_S imatrix 10.7 GB
hermes-qwen3.5-35b-a3b-IQ2_XXS.gguf IQ2_XXS imatrix 9.5 GB
hermes-qwen3.5-35b-a3b-IQ1_M.gguf IQ1_M imatrix 8.2 GB
hermes-qwen3.5-35b-a3b-IQ1_S.gguf IQ1_S imatrix 7.5 GB

All quantizations verified: 733 tensors, GGUF v3.

Choosing a Quant

  • Q8_0 (36.9 GB): Closest to full precision. Use if you have the VRAM/RAM.
  • Q6_K / Q5_K_M (28.5 / 24.7 GB): Good balance of quality and size for most use cases.
  • Q4_K_M (21.2 GB): Popular sweet spot — significant size reduction with minimal quality loss.
  • IQ4_NL / IQ4_XS (19.8 / 18.7 GB): Importance-matrix 4-bit — can outperform standard Q4 quants at similar size.
  • IQ3_M / IQ3_S (15.4 / 15.3 GB): Importance-matrix 3-bit — good quality for the size with imatrix calibration.
  • IQ2_M and below (11.7 GB and smaller): Extreme compression with imatrix. Quality degrades progressively.
  • IQ1_M / IQ1_S (8.2 / 7.5 GB): Maximum compression. Expect significant quality loss.
  • IQ3_M and below: For constrained environments. Quality degrades progressively.
  • IQ2 / IQ1: Extreme compression. Expect notable quality degradation.

Usage

llama.cpp

llama-cli -m hermes-qwen3.5-35b-a3b-Q4_K_M.gguf -p "You are a helpful assistant." -cnv

LM Studio / Ollama / KoboldCpp

Download any GGUF file and load it directly.

Credits

Downloads last month
18,108
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DJLougen/hermes-qwen3.5-35b-a3b-GGUF

Adapter
(21)
this model

Dataset used to train DJLougen/hermes-qwen3.5-35b-a3b-GGUF