Dhara-250M-AR-Base
A 250M parameter autoregressive language model with Canon layers and 32K context length, designed as the base model for the Dhara masked diffusion LLM family.
Table of Contents
- Model Description
- Architecture
- Training Data
- Training Details
- Benchmark Results
- Context Extension Results
- Usage
- YaRN Context Extension
- Key Insights
- Limitations
- Related Work
- Citation
Model Description
Dhara-250M-AR-Base is an autoregressive language model built on the LLaMA3 architecture enhanced with Canon layers β causal 1D depthwise convolutions placed at all 4 positions (ABCD) within each transformer layer. The model was pretrained on 10.2 billion tokens and extended to 32K context length via progressive RoPE scaling, with built-in YaRN support for inference-time extension to 64K-128K.
This model serves as the AR base for the Dhara diffusion LLM pipeline, where the pretrained weights are transferred to a masked diffusion model (MDM) for parallel text generation.
Key Features
- Canon Layers (ABCD): O(n) depthwise convolutions at 4 positions per layer for enhanced local context mixing with only 0.26% parameter overhead
- 32K Native Context: Extended from 4K via progressive RoPE scaling (3 stages)
- YaRN Support: Built-in inference-time context extension to 64K-128K without retraining
- GQA: Grouped Query Attention (12 heads, 4 KV heads) for efficient inference
- QK Normalization: RMSNorm on Q/K after RoPE for training stability
- Logit Softcapping: Prevents logit explosion during training
Architecture
| Specification | Value |
|---|---|
| Parameters | 249.2M |
| Layers | 32 |
| Hidden Size | 768 |
| FF Dimension | 2,176 |
| Attention Heads | 12 |
| KV Heads | 4 (GQA) |
| Head Dimension | 64 |
| Context Length | 32,768 tokens (extendable to 128K via YaRN) |
| RoPE Theta | 8,000,000 |
| Position Encoding | RoPE with YaRN scaling support |
| Normalization | RMSNorm |
| Activation | SiLU (SwiGLU MLP) |
| Vocabulary | 49,152 tokens (custom BPE) |
| Tied Embeddings | Yes |
| Canon Positions | A, B, C, D (all 4) |
| Canon Kernel Size | 4 |
| Canon Parameters | 638,976 (0.26% of total) |
Canon Layer Positions
Based on "Physics of Language Models: Part 4.1" by Zeyuan Allen-Zhu:
- Position A: After input LayerNorm, before attention
- Position B: Inside attention, applied separately to Q/K/V after linear projections
- Position C: After post-attention LayerNorm, before MLP
- Position D: Inside MLP, after gate*up product
Each Canon layer is a causal depthwise 1D convolution with residual connection, adding local sequential context mixing at O(n) cost.
Layer Flow
x β LayerNorm β [Canon-A] β Attention([Canon-B on Q,K,V]) β + residual
x β LayerNorm β [Canon-C] β MLP([Canon-D]) β + residual
Training Data
Stage 1: Pretraining (10.2B tokens)
- Dataset: codelion/sutra-10B
- Tokens: 10.2 billion
- Context Length: 4,096 tokens
- Content: Diverse English text across science, technology, mathematics, social studies, arts, and language arts
- Domain Distribution: interdisciplinary (35%), technology (21%), science (14%), social studies (8%), mathematics (8%), life skills (5%), arts (4%), language arts (2%)
Stage 2: Long Context Extension (900M tokens)
Progressive RoPE scaling across 3 phases using long-document data:
| Phase | Context | RoPE Theta | Tokens | Dataset |
|---|---|---|---|---|
| 1: 4K β 8K | 8,192 | 500,000 | 200M | allenai/dolma3_longmino_mix-50B-1025 (lc_synth) |
| 2: 8K β 16K | 16,384 | 2,000,000 | 200M | allenai/dolma3_longmino_mix-50B-1025 (lc_synth) |
| 3: 16K β 32K | 32,768 | 8,000,000 | 500M | allenai/dolma3_longmino_mix-50B-1025 (lc_synth) |
Training Details
| Parameter | Pretraining | Context Extension |
|---|---|---|
| Tokens | 10.2B | 900M (3 phases) |
| Batch Size | 64 effective | 16 effective |
| Learning Rate | 3e-4 β 3e-5 (cosine) | 3e-5 β 1e-5 (cosine) |
| Warmup | 389 steps (1%) | 20 steps per phase |
| Optimizer | AdamW (betas=0.9, 0.95) | AdamW (betas=0.9, 0.95) |
| Weight Decay | 0.1 | 0.1 |
| Precision | BF16 | BF16 |
| Hardware | Single NVIDIA RTX PRO 6000 Blackwell (96GB) | Same |
| Pretraining Time | ~48 hours | ~6.5 hours total |
| Throughput | ~60,000 tokens/sec | 33,000-55,000 tokens/sec |
Benchmark Results
Evaluated using lm-evaluation-harness v0.4.11 (0-shot unless noted):
| Benchmark | Before Context Ext. | After Context Ext. | Delta |
|---|---|---|---|
| PIQA | 57.78% | 57.40% | -0.38 |
| WinoGrande | 48.78% | 51.30% | +2.52 |
| TruthfulQA (mc2) | 49.79% | 50.13% | +0.34 |
| BoolQ | 37.83% | 37.83% | 0.00 |
| OpenBookQA | 31.60% | 32.40% | +0.80 |
| ARC-Easy | 30.30% | 30.18% | -0.12 |
| HellaSwag | 27.95% | 27.16% | -0.79 |
| ARC-Challenge | 26.62% | 25.51% | -1.11 |
| MMLU (5-shot) | 22.95% | 22.95% | 0.00 |
| SciQ | 22.00% | 21.30% | -0.70 |
| Average | 35.56% | 35.62% | +0.06 |
Context extension to 32K preserved short-context benchmark performance with negligible change.
Context Extension Results
Perplexity measured on held-out PG19 (Project Gutenberg) long documents:
| Context Length | Before (4K trained) | After (32K extended) | Improvement |
|---|---|---|---|
| 4,096 | 33.03 | 35.65 | ~same |
| 16,384 | 163.64 | 38.60 | 4.2x better |
| 32,768 | 702.64 | 41.57 | 16.9x better |
Before context extension, perplexity exploded beyond 4K (33 β 703). After extension, perplexity stays nearly flat across all lengths (36 β 42), confirming the model can effectively use the full 32K context.
Usage
Basic Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("codelion/dhara-250m-ar-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"codelion/dhara-250m-ar-base",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
# Generate text
prompt = "The most important discovery in physics was"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
inputs.input_ids,
max_new_tokens=100,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.2,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Long Context Usage (32K)
# The model natively supports 32K context
long_text = "..." # up to 32K tokens
inputs = tokenizer(long_text, return_tensors="pt", truncation=True, max_length=32768).to(device)
outputs = model(inputs.input_ids)
YaRN Context Extension
The model includes built-in YaRN (Yet another RoPE extensioN) support for inference-time context extension beyond 32K without retraining:
from transformers import AutoModelForCausalLM
# Extend to 64K (2x)
model = AutoModelForCausalLM.from_pretrained(
"codelion/dhara-250m-ar-base",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
rope_scaling={"type": "yarn", "factor": 2.0, "original_max_position_embeddings": 32768},
)
# Extend to 128K (4x)
model = AutoModelForCausalLM.from_pretrained(
"codelion/dhara-250m-ar-base",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
rope_scaling={"type": "yarn", "factor": 4.0, "original_max_position_embeddings": 32768},
)
YaRN splits RoPE frequency dimensions into three groups β high frequencies (local positions) are preserved, low frequencies (global positions) are interpolated β giving better quality than naive scaling.
Key Insights
Canon Layers Add Local Context Cheaply: All 4 Canon positions (ABCD) add depthwise causal convolutions throughout the model with only 0.26% parameter overhead, enhancing local sequential pattern recognition.
Progressive RoPE Extension Works: Three-stage context extension (4K β 8K β 16K β 32K) with increasing theta preserved short-context quality while enabling flat perplexity out to 32K.
AR as Foundation for MDM: This model is designed as a weight donor for masked diffusion model conversion. The embeddings, MLPs, and Canon layers transfer directly; only the attention patterns need retraining for bidirectional denoising.
Efficient Training: The full pipeline (10.2B pretraining + 900M context extension) completed in ~55 hours on a single GPU.
Limitations
- This is a base model without instruction tuning β it generates text continuations, not answers
- Performance is limited by the 250M parameter scale and 10.2B token training budget
- The tokenizer is English-focused; non-Latin scripts fall back to byte-level encoding
- Long-context quality depends on having relevant content filling the context window
Related Work
- Scaling Pedagogical Pretraining to 10 Billion Tokens β Prior work on scaling pretraining data
- The Optimal Architecture for Small Language Models β Architecture design principles used in this model
- Dhara-70M β Our 70M diffusion language model
- Physics of Language Models: Part 4.1 β Canon layers (Zeyuan Allen-Zhu)
Contact
For questions or feedback, please open a discussion on the Hugging Face discussions page.
- Downloads last month
- 40