--- base_model: Qwen/Qwen3.5-35B-A3B tags: - qwen3.5 - moe - gguf - lora - function-calling - hermes - unsloth license: apache-2.0 datasets: - NousResearch/hermes-function-calling-v1 pipeline_tag: text-generation model-index: - name: hermes-qwen3.5-35b-a3b-GGUF results: [] --- Note: These models are optimized for use within an agentic harness (e.g. Hermes Agent) and may behave unexpectedly in raw inference without a system prompt. Capability benchmarks are strong but conversational behavior outside of a structured harness is not reliable. I am currently working on v2 to address this and reduce harness dependency.

Support This Work

I'm a PhD student who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. It's a hobby that got out of hand.

If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.

☕ ko-fi.com/djlougen

--- # Hermes Qwen3.5 35B-A3B GGUF GGUF quantizations of a Qwen3.5-35B-A3B model fine-tuned on [NousResearch/hermes-function-calling-v1](https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1) for structured function calling and tool use. ## Base Model - **Architecture:** Qwen3.5 MoE (Mixture of Experts) — 35B total parameters, ~3B active per token - **Base:** [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) - **Context Length:** 262,144 tokens - **Experts:** 256 total, 8 active per token ## Fine-Tuning Details - **Method:** LoRA via [Unsloth](https://github.com/unslothai/unsloth) + TRL SFTTrainer - **LoRA Rank (r):** 32 - **LoRA Alpha:** 32 - **LoRA Dropout:** 0 - **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` - **Training Precision:** bf16 - **Optimizer:** AdamW 8-bit - **Learning Rate:** 2e-4 with cosine scheduler - **Warmup Steps:** 10 - **Epochs:** 3 - **Batch Size:** 2 per device, 8 gradient accumulation steps (effective batch size 16) - **Max Sequence Length:** 4,096 tokens - **Weight Decay:** 0.01 - **PEFT Version:** 0.18.1 ### Training Dataset [NousResearch/hermes-function-calling-v1](https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1) — a function-calling dataset following the Hermes Function-calling Standard. Includes: - Cleaned Glaive Function Calling samples - Advanced JSON structured output (agentic, multi-turn) - Single-turn JSON structured output samples Conversations were formatted using ChatML (`<|im_start|>` / `<|im_end|>`) with role mapping: `system`, `human` -> `user`, `gpt` -> `assistant`, `tool`. ## Quantization All quantizations were produced using [llama.cpp](https://github.com/ggerganov/llama.cpp) with an **importance matrix** (imatrix) computed from WikiText-2 calibration data for improved quality at lower bit depths. ### Available Quants | Filename | Quant | Type | Size | |----------|-------|------|------| | hermes-qwen3.5-35b-a3b-f16.gguf | F16 | Full precision | 64.6 GB | | hermes-qwen3.5-35b-a3b-Q8_0.gguf | Q8_0 | Standard | 36.9 GB | | hermes-qwen3.5-35b-a3b-Q6_K.gguf | Q6_K | K-quant | 28.5 GB | | hermes-qwen3.5-35b-a3b-Q5_K_M.gguf | Q5_K_M | K-quant | 24.7 GB | | hermes-qwen3.5-35b-a3b-Q5_K_S.gguf | Q5_K_S | K-quant | 24.0 GB | | hermes-qwen3.5-35b-a3b-Q4_K_M.gguf | Q4_K_M | K-quant | 21.2 GB | | hermes-qwen3.5-35b-a3b-Q4_K_S.gguf | Q4_K_S | K-quant | 19.9 GB | | hermes-qwen3.5-35b-a3b-IQ4_NL.gguf | IQ4_NL | imatrix | 19.8 GB | | hermes-qwen3.5-35b-a3b-IQ4_XS.gguf | IQ4_XS | imatrix | 18.7 GB | | hermes-qwen3.5-35b-a3b-Q3_K_M.gguf | Q3_K_M | K-quant | 16.8 GB | | hermes-qwen3.5-35b-a3b-IQ3_M.gguf | IQ3_M | imatrix | 15.4 GB | | hermes-qwen3.5-35b-a3b-IQ3_S.gguf | IQ3_S | imatrix | 15.3 GB | | hermes-qwen3.5-35b-a3b-Q3_K_S.gguf | Q3_K_S | K-quant | 15.2 GB | | hermes-qwen3.5-35b-a3b-IQ3_XXS.gguf | IQ3_XXS | imatrix | 13.6 GB | | hermes-qwen3.5-35b-a3b-IQ2_M.gguf | IQ2_M | imatrix | 11.7 GB | | hermes-qwen3.5-35b-a3b-IQ2_S.gguf | IQ2_S | imatrix | 10.7 GB | | hermes-qwen3.5-35b-a3b-IQ2_XXS.gguf | IQ2_XXS | imatrix | 9.5 GB | | hermes-qwen3.5-35b-a3b-IQ1_M.gguf | IQ1_M | imatrix | 8.2 GB | | hermes-qwen3.5-35b-a3b-IQ1_S.gguf | IQ1_S | imatrix | 7.5 GB | All quantizations verified: 733 tensors, GGUF v3. ### Choosing a Quant - **Q8_0** (36.9 GB): Closest to full precision. Use if you have the VRAM/RAM. - **Q6_K / Q5_K_M** (28.5 / 24.7 GB): Good balance of quality and size for most use cases. - **Q4_K_M** (21.2 GB): Popular sweet spot — significant size reduction with minimal quality loss. - **IQ4_NL / IQ4_XS** (19.8 / 18.7 GB): Importance-matrix 4-bit — can outperform standard Q4 quants at similar size. - **IQ3_M / IQ3_S** (15.4 / 15.3 GB): Importance-matrix 3-bit — good quality for the size with imatrix calibration. - **IQ2_M and below** (11.7 GB and smaller): Extreme compression with imatrix. Quality degrades progressively. - **IQ1_M / IQ1_S** (8.2 / 7.5 GB): Maximum compression. Expect significant quality loss. - **IQ3_M and below**: For constrained environments. Quality degrades progressively. - **IQ2 / IQ1**: Extreme compression. Expect notable quality degradation. ## Usage ### llama.cpp ```bash llama-cli -m hermes-qwen3.5-35b-a3b-Q4_K_M.gguf -p "You are a helpful assistant." -cnv ``` ### LM Studio / Ollama / KoboldCpp Download any GGUF file and load it directly. ## Credits - **Base Model:** [Qwen Team](https://huggingface.co/Qwen) - **Training Dataset:** [NousResearch](https://huggingface.co/NousResearch) - **Fine-Tuning Framework:** [Unsloth](https://github.com/unslothai/unsloth) - **Quantization Tooling:** [llama.cpp](https://github.com/ggerganov/llama.cpp)