Dhara-250M-AR-Base

A 250M parameter autoregressive language model with Canon layers and 32K context length, designed as the base model for the Dhara masked diffusion LLM family.

Table of Contents

Model Description

Dhara-250M-AR-Base is an autoregressive language model built on the LLaMA3 architecture enhanced with Canon layers β€” causal 1D depthwise convolutions placed at all 4 positions (ABCD) within each transformer layer. The model was pretrained on 10.2 billion tokens and extended to 32K context length via progressive RoPE scaling, with built-in YaRN support for inference-time extension to 64K-128K.

This model serves as the AR base for the Dhara diffusion LLM pipeline, where the pretrained weights are transferred to a masked diffusion model (MDM) for parallel text generation.

Key Features

  • Canon Layers (ABCD): O(n) depthwise convolutions at 4 positions per layer for enhanced local context mixing with only 0.26% parameter overhead
  • 32K Native Context: Extended from 4K via progressive RoPE scaling (3 stages)
  • YaRN Support: Built-in inference-time context extension to 64K-128K without retraining
  • GQA: Grouped Query Attention (12 heads, 4 KV heads) for efficient inference
  • QK Normalization: RMSNorm on Q/K after RoPE for training stability
  • Logit Softcapping: Prevents logit explosion during training

Architecture

Specification Value
Parameters 249.2M
Layers 32
Hidden Size 768
FF Dimension 2,176
Attention Heads 12
KV Heads 4 (GQA)
Head Dimension 64
Context Length 32,768 tokens (extendable to 128K via YaRN)
RoPE Theta 8,000,000
Position Encoding RoPE with YaRN scaling support
Normalization RMSNorm
Activation SiLU (SwiGLU MLP)
Vocabulary 49,152 tokens (custom BPE)
Tied Embeddings Yes
Canon Positions A, B, C, D (all 4)
Canon Kernel Size 4
Canon Parameters 638,976 (0.26% of total)

Canon Layer Positions

Based on "Physics of Language Models: Part 4.1" by Zeyuan Allen-Zhu:

  • Position A: After input LayerNorm, before attention
  • Position B: Inside attention, applied separately to Q/K/V after linear projections
  • Position C: After post-attention LayerNorm, before MLP
  • Position D: Inside MLP, after gate*up product

Each Canon layer is a causal depthwise 1D convolution with residual connection, adding local sequential context mixing at O(n) cost.

Layer Flow

x β†’ LayerNorm β†’ [Canon-A] β†’ Attention([Canon-B on Q,K,V]) β†’ + residual
x β†’ LayerNorm β†’ [Canon-C] β†’ MLP([Canon-D]) β†’ + residual

Training Data

Stage 1: Pretraining (10.2B tokens)

  • Dataset: codelion/sutra-10B
  • Tokens: 10.2 billion
  • Context Length: 4,096 tokens
  • Content: Diverse English text across science, technology, mathematics, social studies, arts, and language arts
  • Domain Distribution: interdisciplinary (35%), technology (21%), science (14%), social studies (8%), mathematics (8%), life skills (5%), arts (4%), language arts (2%)

Stage 2: Long Context Extension (900M tokens)

Progressive RoPE scaling across 3 phases using long-document data:

Phase Context RoPE Theta Tokens Dataset
1: 4K β†’ 8K 8,192 500,000 200M allenai/dolma3_longmino_mix-50B-1025 (lc_synth)
2: 8K β†’ 16K 16,384 2,000,000 200M allenai/dolma3_longmino_mix-50B-1025 (lc_synth)
3: 16K β†’ 32K 32,768 8,000,000 500M allenai/dolma3_longmino_mix-50B-1025 (lc_synth)

Training Details

Parameter Pretraining Context Extension
Tokens 10.2B 900M (3 phases)
Batch Size 64 effective 16 effective
Learning Rate 3e-4 β†’ 3e-5 (cosine) 3e-5 β†’ 1e-5 (cosine)
Warmup 389 steps (1%) 20 steps per phase
Optimizer AdamW (betas=0.9, 0.95) AdamW (betas=0.9, 0.95)
Weight Decay 0.1 0.1
Precision BF16 BF16
Hardware Single NVIDIA RTX PRO 6000 Blackwell (96GB) Same
Pretraining Time ~48 hours ~6.5 hours total
Throughput ~60,000 tokens/sec 33,000-55,000 tokens/sec

Benchmark Results

Evaluated using lm-evaluation-harness v0.4.11 (0-shot unless noted):

Benchmark Before Context Ext. After Context Ext. Delta
PIQA 57.78% 57.40% -0.38
WinoGrande 48.78% 51.30% +2.52
TruthfulQA (mc2) 49.79% 50.13% +0.34
BoolQ 37.83% 37.83% 0.00
OpenBookQA 31.60% 32.40% +0.80
ARC-Easy 30.30% 30.18% -0.12
HellaSwag 27.95% 27.16% -0.79
ARC-Challenge 26.62% 25.51% -1.11
MMLU (5-shot) 22.95% 22.95% 0.00
SciQ 22.00% 21.30% -0.70
Average 35.56% 35.62% +0.06

Context extension to 32K preserved short-context benchmark performance with negligible change.

Context Extension Results

Perplexity measured on held-out PG19 (Project Gutenberg) long documents:

Context Length Before (4K trained) After (32K extended) Improvement
4,096 33.03 35.65 ~same
16,384 163.64 38.60 4.2x better
32,768 702.64 41.57 16.9x better

Before context extension, perplexity exploded beyond 4K (33 β†’ 703). After extension, perplexity stays nearly flat across all lengths (36 β†’ 42), confirming the model can effectively use the full 32K context.

Usage

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("codelion/dhara-250m-ar-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "codelion/dhara-250m-ar-base",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Generate text
prompt = "The most important discovery in physics was"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
    inputs.input_ids,
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.2,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Long Context Usage (32K)

# The model natively supports 32K context
long_text = "..." # up to 32K tokens
inputs = tokenizer(long_text, return_tensors="pt", truncation=True, max_length=32768).to(device)
outputs = model(inputs.input_ids)

YaRN Context Extension

The model includes built-in YaRN (Yet another RoPE extensioN) support for inference-time context extension beyond 32K without retraining:

from transformers import AutoModelForCausalLM

# Extend to 64K (2x)
model = AutoModelForCausalLM.from_pretrained(
    "codelion/dhara-250m-ar-base",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    rope_scaling={"type": "yarn", "factor": 2.0, "original_max_position_embeddings": 32768},
)

# Extend to 128K (4x)
model = AutoModelForCausalLM.from_pretrained(
    "codelion/dhara-250m-ar-base",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    rope_scaling={"type": "yarn", "factor": 4.0, "original_max_position_embeddings": 32768},
)

YaRN splits RoPE frequency dimensions into three groups β€” high frequencies (local positions) are preserved, low frequencies (global positions) are interpolated β€” giving better quality than naive scaling.

Key Insights

  1. Canon Layers Add Local Context Cheaply: All 4 Canon positions (ABCD) add depthwise causal convolutions throughout the model with only 0.26% parameter overhead, enhancing local sequential pattern recognition.

  2. Progressive RoPE Extension Works: Three-stage context extension (4K β†’ 8K β†’ 16K β†’ 32K) with increasing theta preserved short-context quality while enabling flat perplexity out to 32K.

  3. AR as Foundation for MDM: This model is designed as a weight donor for masked diffusion model conversion. The embeddings, MLPs, and Canon layers transfer directly; only the attention patterns need retraining for bidirectional denoising.

  4. Efficient Training: The full pipeline (10.2B pretraining + 900M context extension) completed in ~55 hours on a single GPU.

Limitations

  • This is a base model without instruction tuning β€” it generates text continuations, not answers
  • Performance is limited by the 250M parameter scale and 10.2B token training budget
  • The tokenizer is English-focused; non-Latin scripts fall back to byte-level encoding
  • Long-context quality depends on having relevant content filling the context window

Related Work

Contact

For questions or feedback, please open a discussion on the Hugging Face discussions page.

Downloads last month
40
Safetensors
Model size
0.2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train codelion/dhara-250m-ar-base

Collection including codelion/dhara-250m-ar-base

Paper for codelion/dhara-250m-ar-base