Dhara-250M-AR-Base

A 250M parameter autoregressive language model with Canon layers and 32K context length, designed as the base model for the Dhara masked diffusion LLM family.

Model Description
Architecture
Training Data
Training Details
Benchmark Results
Context Extension Results
Usage
YaRN Context Extension
Key Insights
Limitations
Related Work
Citation

Model Description

Dhara-250M-AR-Base is an autoregressive language model built on the LLaMA3 architecture enhanced with Canon layers — causal 1D depthwise convolutions placed at all 4 positions (ABCD) within each transformer layer. The model was pretrained on 10.2 billion tokens and extended to 32K context length via progressive RoPE scaling, with built-in YaRN support for inference-time extension to 64K-128K.

This model serves as the AR base for the Dhara diffusion LLM pipeline, where the pretrained weights are transferred to a masked diffusion model (MDM) for parallel text generation.

Key Features

Canon Layers (ABCD): O(n) depthwise convolutions at 4 positions per layer for enhanced local context mixing with only 0.26% parameter overhead
32K Native Context: Extended from 4K via progressive RoPE scaling (3 stages)
YaRN Support: Built-in inference-time context extension to 64K-128K without retraining
GQA: Grouped Query Attention (12 heads, 4 KV heads) for efficient inference
QK Normalization: RMSNorm on Q/K after RoPE for training stability
Logit Softcapping: Prevents logit explosion during training

Architecture

Specification	Value
Parameters	249.2M
Layers	32
Hidden Size	768
FF Dimension	2,176
Attention Heads	12
KV Heads	4 (GQA)
Head Dimension	64
Context Length	32,768 tokens (extendable to 128K via YaRN)
RoPE Theta	8,000,000
Position Encoding	RoPE with YaRN scaling support
Normalization	RMSNorm
Activation	SiLU (SwiGLU MLP)
Vocabulary	49,152 tokens (custom BPE)
Tied Embeddings	Yes
Canon Positions	A, B, C, D (all 4)
Canon Kernel Size	4
Canon Parameters	638,976 (0.26% of total)

Canon Layer Positions

Based on "Physics of Language Models: Part 4.1" by Zeyuan Allen-Zhu:

Position A: After input LayerNorm, before attention
Position B: Inside attention, applied separately to Q/K/V after linear projections
Position C: After post-attention LayerNorm, before MLP
Position D: Inside MLP, after gate*up product

Each Canon layer is a causal depthwise 1D convolution with residual connection, adding local sequential context mixing at O(n) cost.

Layer Flow

x → LayerNorm → [Canon-A] → Attention([Canon-B on Q,K,V]) → + residual
x → LayerNorm → [Canon-C] → MLP([Canon-D]) → + residual

Training Data

Stage 1: Pretraining (10.2B tokens)

Dataset: codelion/sutra-10B
Tokens: 10.2 billion
Context Length: 4,096 tokens
Content: Diverse English text across science, technology, mathematics, social studies, arts, and language arts
Domain Distribution: interdisciplinary (35%), technology (21%), science (14%), social studies (8%), mathematics (8%), life skills (5%), arts (4%), language arts (2%)

Stage 2: Long Context Extension (900M tokens)

Progressive RoPE scaling across 3 phases using long-document data:

Phase	Context	RoPE Theta	Tokens	Dataset
1: 4K → 8K	8,192	500,000	200M	allenai/dolma3_longmino_mix-50B-1025 (lc_synth)
2: 8K → 16K	16,384	2,000,000	200M	allenai/dolma3_longmino_mix-50B-1025 (lc_synth)
3: 16K → 32K	32,768	8,000,000	500M	allenai/dolma3_longmino_mix-50B-1025 (lc_synth)

Training Details

Parameter	Pretraining	Context Extension
Tokens	10.2B	900M (3 phases)
Batch Size	64 effective	16 effective
Learning Rate	3e-4 → 3e-5 (cosine)	3e-5 → 1e-5 (cosine)
Warmup	389 steps (1%)	20 steps per phase
Optimizer	AdamW (betas=0.9, 0.95)	AdamW (betas=0.9, 0.95)
Weight Decay	0.1	0.1
Precision	BF16	BF16
Hardware	Single NVIDIA RTX PRO 6000 Blackwell (96GB)	Same
Pretraining Time	~48 hours	~6.5 hours total
Throughput	~60,000 tokens/sec	33,000-55,000 tokens/sec

Benchmark Results

Evaluated using lm-evaluation-harness v0.4.11 (0-shot unless noted):

Benchmark	Before Context Ext.	After Context Ext.	Delta
PIQA	57.78%	57.40%	-0.38
WinoGrande	48.78%	51.30%	+2.52
TruthfulQA (mc2)	49.79%	50.13%	+0.34
BoolQ	37.83%	37.83%	0.00
OpenBookQA	31.60%	32.40%	+0.80
ARC-Easy	30.30%	30.18%	-0.12
HellaSwag	27.95%	27.16%	-0.79
ARC-Challenge	26.62%	25.51%	-1.11
MMLU (5-shot)	22.95%	22.95%	0.00
SciQ	22.00%	21.30%	-0.70
Average	35.56%	35.62%	+0.06

Context extension to 32K preserved short-context benchmark performance with negligible change.

Context Extension Results

Perplexity measured on held-out PG19 (Project Gutenberg) long documents:

Context Length	Before (4K trained)	After (32K extended)	Improvement
4,096	33.03	35.65	~same
16,384	163.64	38.60	4.2x better
32,768	702.64	41.57	16.9x better

Before context extension, perplexity exploded beyond 4K (33 → 703). After extension, perplexity stays nearly flat across all lengths (36 → 42), confirming the model can effectively use the full 32K context.

Usage

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("codelion/dhara-250m-ar-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "codelion/dhara-250m-ar-base",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Generate text
prompt = "The most important discovery in physics was"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
    inputs.input_ids,
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.2,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Long Context Usage (32K)

# The model natively supports 32K context
long_text = "..." # up to 32K tokens
inputs = tokenizer(long_text, return_tensors="pt", truncation=True, max_length=32768).to(device)
outputs = model(inputs.input_ids)

YaRN Context Extension

The model includes built-in YaRN (Yet another RoPE extensioN) support for inference-time context extension beyond 32K without retraining:

from transformers import AutoModelForCausalLM

# Extend to 64K (2x)
model = AutoModelForCausalLM.from_pretrained(
    "codelion/dhara-250m-ar-base",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    rope_scaling={"type": "yarn", "factor": 2.0, "original_max_position_embeddings": 32768},
)

# Extend to 128K (4x)
model = AutoModelForCausalLM.from_pretrained(
    "codelion/dhara-250m-ar-base",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    rope_scaling={"type": "yarn", "factor": 4.0, "original_max_position_embeddings": 32768},
)

YaRN splits RoPE frequency dimensions into three groups — high frequencies (local positions) are preserved, low frequencies (global positions) are interpolated — giving better quality than naive scaling.

Key Insights

Canon Layers Add Local Context Cheaply: All 4 Canon positions (ABCD) add depthwise causal convolutions throughout the model with only 0.26% parameter overhead, enhancing local sequential pattern recognition.
Progressive RoPE Extension Works: Three-stage context extension (4K → 8K → 16K → 32K) with increasing theta preserved short-context quality while enabling flat perplexity out to 32K.
AR as Foundation for MDM: This model is designed as a weight donor for masked diffusion model conversion. The embeddings, MLPs, and Canon layers transfer directly; only the attention patterns need retraining for bidirectional denoising.
Efficient Training: The full pipeline (10.2B pretraining + 900M context extension) completed in ~55 hours on a single GPU.

Limitations

This is a base model without instruction tuning — it generates text continuations, not answers
Performance is limited by the 250M parameter scale and 10.2B token training budget
The tokenizer is English-focused; non-Latin scripts fall back to byte-level encoding
Long-context quality depends on having relevant content filling the context window

Related Work

Scaling Pedagogical Pretraining to 10 Billion Tokens — Prior work on scaling pretraining data
The Optimal Architecture for Small Language Models — Architecture design principles used in this model
Dhara-70M — Our 70M diffusion language model
Physics of Language Models: Part 4.1 — Canon layers (Zeyuan Allen-Zhu)

Contact

For questions or feedback, please open a discussion on the Hugging Face discussions page.

Downloads last month: 40

Safetensors

Model size

0.2B params

Tensor type

BF16

Datasets used to train codelion/dhara-250m-ar-base

Collection including codelion/dhara-250m-ar-base

Dhara Foundational Models

Collection

Diffusion Language Models combining deep narrow networks, Canon layers (depthwise causal convolutions), and WSD (Warmup-Stable-Decay) training. • 2 items • Updated 21 days ago • 3

Paper for codelion/dhara-250m-ar-base

Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process

Paper • 2407.20311 • Published Jul 29, 2024 • 5

codelion
/

dhara-250m-ar-base