Ornstein-hermes-3.6-27b
A Hermes-format function-calling fine-tune of Ornstein-3.6-27B, built on top of Qwen 3.6 27B's hybrid linear + full attention multimodal architecture. Trained on the top-quality Hermes-format slice of DJLougen/Acta, a curated 8-factor agentic tool-use dataset.
GGUF quantizations available at GestaltLabs/Ornstein-Hermes-3.6-27b-GGUF — Q8_0 down through aggressive 2-bit I-quants.
Support This Work
I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded — balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.
Intended Use
This model is specialized for Hermes-format function calling and agentic tool use:
<tools>[...]</tools>system-prompt tool registration<think>...</think>reasoning blocks before tool selection<tool_call>{...}</tool_call>structured function-call emission<tool_response>...</tool_response>multi-turn tool-result handling
The base Ornstein-3.6-27B's vision encoder is preserved — image + text inputs still work — but training data was text-only, so vision-language behavior is inherited from the base, not tuned.
Training Details
| Base model | GestaltLabs/Ornstein-3.6-27B |
| Dataset | DJLougen/Acta — hermes_reasoning ∪ hermes_multiturn, filtered to quality_score >= 0.85 |
| Training rows | 1,559 train / 153 validation |
| Method | LoRA (r=32, α=32, dropout=0.0) on language layers only — attention + MLP linears |
| Trainable params | 233.5M (0.85% of 27.6B) |
| Precision | bf16 (no quantization during training) |
| Sequence length | 4,096 tokens (dataset p99 ≈ 2,400) |
| Optimizer | AdamW, lr 1e-4, cosine schedule, 3% warmup |
| Effective batch size | 16 (per-device 2 × grad-accum 8) |
| Epochs | 2 (196 optimizer steps) |
| Hardware | 1× NVIDIA RTX PRO 6000 Blackwell (97 GB) |
| Wall time | ≈ 2h 03m |
| Framework | Unsloth 2026.4.8 + TRL 0.24 + PEFT 0.19 |
Eval Loss Trajectory
Eval every 25 steps on a held-out 153-sample Hermes validation set:
| step | eval_loss |
|---|---|
| 25 | 0.4809 |
| 50 | 0.4433 |
| 75 | 0.4269 |
| 100 | 0.4137 |
| 125 | 0.4043 |
| 150 | 0.3991 |
| 175 | 0.3957 |
| 200 | 0.3953 ← best |
Monotonic descent across all 8 evaluation points, no overfitting signal. load_best_model_at_end=True, so the saved adapter (and this merge) is the best-eval checkpoint.
Why this filter?
The Acta dataset mixes 5 source corpora at varying quality. The Hermes subsets (hermes_reasoning, hermes_multiturn) ship with the canonical role layout (system / user / assistant / tool) that Qwen's chat template handles natively, and have the highest mean quality scores in the dataset (0.902 / 0.833 respectively). The other sources (flamefox_agentic, smolagents_code, swe_agent_glm) either use atypical role assignment or score below 0.85 and were excluded to keep the training distribution clean.
Usage
Transformers
from transformers import AutoModelForImageTextToText, AutoProcessor
model_id = "GestaltLabs/Ornstein-hermes-3.6-27b"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
model_id, torch_dtype="auto", device_map="auto", trust_remote_code=True
)
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city.",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}
]
messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]
inputs = processor.apply_chat_template(
messages, tools=tools, add_generation_prompt=True,
tokenize=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=512)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[-1]:], skip_special_tokens=True)[0])
llama.cpp / Ollama / LM Studio (via GGUF)
See the GGUF repo — pick a quant that fits your memory:
- Q4_K_M — strong default for 24 GB cards
- Q5_K_M / Q6_K — higher fidelity if you have 32–48 GB
- Q8_0 — near-lossless, ~28 GB
- IQ4_NL / IQ4_XS — imatrix-calibrated 4-bit, smaller than Q4_K_M
- IQ3_M / IQ2_M — aggressive but usable for tight VRAM budgets
Architecture
Inherited unchanged from Ornstein-3.6-27B:
Qwen3_5ForConditionalGeneration— Qwen 3.6 dense with linear + full attention interleaved (Gated Delta Net) + vision encoder- ~27B parameters total (dense, multimodal)
- Hidden size 5,120 / 64 layers
- Attention: 24 heads, 4 KV heads, head_dim 256, full-attention every 4 layers (linear otherwise)
- Context length: 262,144 tokens
License
Apache 2.0 — inherited from the Qwen 3.6 base release. Acta is also Apache 2.0.
Citation
If you use this model, please consider citing the dataset:
@dataset{lougen_acta_2026,
author = {DJLougen},
title = {Acta: A Premium Curated Sample of High-Quality Agentic Tool-Use Conversations},
year = {2026},
url = {https://huggingface.co/datasets/DJLougen/Acta}
}
- Downloads last month
- 380

