LightOnOCR-2-1B for Old Church Slavonic (Line-Level)

This model is a fine-tuned version of lightonai/LightOnOCR-2-1B-base specifically trained for line-level OCR of Old Church Slavonic (OCS) manuscripts.

Model Description

Base Model: lightonai/LightOnOCR-2-1B-base
Training Data: wjbmattingly/old-church-slavonic-line
Task: Line-level text transcription from manuscript images
Language: Old Church Slavonic (cu/orv)
Architecture: Vision-Language Model (1B parameters)

This is a line-level model - it expects cropped line images as input, not full pages. Each image should contain a single line of text.

Evaluation Results

Evaluated on 50 samples from the test set:

Metric	Base Model	Finetuned	Improvement
CER (%)	141.34	5.04	+136.30
WER (%)	112.34	27.85	+84.49
Perfect Matches	0	15	+15

Lower CER/WER is better. Higher perfect matches is better.

Example Outputs

#	Ground Truth	Base Model	Finetuned
1	дьници въсиꙗють рєчє сл҃нцє ꙗко лоуна∙ ꙗ...	$\text{A} \text{B} \text{C} \text{D} \te...	✓ дьници въсиꙗють рєчє сл҃нцє ꙗко лоуна∙ ꙗ...
2	свѣтьлость тврьдьнаѧ∙ и провѣды чл҃чє нє...	СКУТДЛАСТЬТЕРДЕНАД. НПРОВЕДЫ ТЛТНЕЖВЕ	свѣтолость тврьдьнаѧ∙ и провѣды чл҃чє нє...
3	риѥ б҃ь∙жажєлицємь маломь въ жатвю∙ свѣт...	ОНИКЬ. ЖАЖЕЛНЦЕЛЬДАПОЛЪВЪЖАТБЮ. СЕВЬ	риѥ б҃ь∙ жа жє лицємь маломь въ жатвю∙ с...
4	лы дасть ꙁарѧ сиѧти ис тѣлєсє∙ да оть ви...		лы дасть ꙁа рѧси ѧти истѣлє сє∙ да оть в...
5	ихь вѣроуѥмо боудєть и даѥмоѥ∙ часть бо ...	НЬВЕРОВНЯЛО РОУДЬ ГЪНДА НПЛОЮ. ТАСТЬ БОП...	ихь вѣроуѥмо боудєть и да ѥмоѥ∙ часть бо...

✓ = exact match

Usage

Installation

# Requires transformers from source
pip install git+https://github.com/huggingface/transformers
pip install pillow torch

Python Usage

import torch
from transformers import LightOnOcrForConditionalGeneration, LightOnOcrProcessor
from PIL import Image

# Load model and processor
model_id = "wjbmattingly/LightOnOCR-2-1B-old-church-slavonic-line"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32

processor = LightOnOcrProcessor.from_pretrained(model_id)
model = LightOnOcrForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=dtype,
).to(device)

# Load your line image
image = Image.open("your_line_image.jpg").convert("RGB")

# Prepare input
messages = [{"role": "user", "content": [{"type": "image"}]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = processor(
    text=[text],
    images=[[image]],
    return_tensors="pt",
    padding=True,
    size={"longest_edge": 700},
).to(device)
inputs["pixel_values"] = inputs["pixel_values"].to(dtype)

# Generate transcription
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False)

# Decode output
input_length = inputs["input_ids"].shape[1]
generated_ids = outputs[0, input_length:]
transcription = processor.decode(generated_ids, skip_special_tokens=True)

print(transcription)
# Example output: дьници въсиꙗють рєчє сл҃нцє ꙗко лоуна∙ ꙗко

Batch Inference

from datasets import load_dataset

# Load dataset
dataset = load_dataset("wjbmattingly/old-church-slavonic-line", split="train[:10]")

# Process batch
images = [[img.convert("RGB")] for img in dataset["image"]]
messages = [{"role": "user", "content": [{"type": "image"}]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
texts = [text] * len(images)

inputs = processor(
    text=texts,
    images=images,
    return_tensors="pt",
    padding=True,
    size={"longest_edge": 700},
).to(device)
inputs["pixel_values"] = inputs["pixel_values"].to(dtype)

outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False)
predictions = processor.batch_decode(outputs[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)

for pred, gt in zip(predictions, dataset["text"]):
    print(f"Prediction: {pred}")
    print(f"Ground Truth: {gt}")
    print()

Training Details

Base Model: lightonai/LightOnOCR-2-1B-base
Training Method: Fine-tuning with frozen language model backbone
Optimizer: AdamW (fused)
Learning Rate: 6e-5 with linear decay
Precision: bfloat16

Limitations

This model is trained on line-level images only. For full-page transcription, you need to first segment the page into individual lines.
Performance may vary on manuscript styles not represented in the training data.
Old Church Slavonic has many abbreviations and special characters that may require domain-specific post-processing.

Citation

If you use this model, please cite:

@misc{lightonocr2_ocs_2026,
  title = {LightOnOCR Fine-tuned for Old Church Slavonic},
  author = {William Mattingly},
  year = {2026},
  howpublished = {\url{https://huggingface.co/wjbmattingly/LightOnOCR-2-1B-old-church-slavonic-line}}
}

And the original LightOnOCR paper:

@misc{lightonocr2_2026,
  title = {LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR},
  author = {Said Taghadouini and Adrien Cavaill\`{e}s and Baptiste Aubertin},
  year = {2026},
  howpublished = {\url{https://arxiv.org/pdf/2601.14251}}
}

Acknowledgments

LightOn AI for the excellent LightOnOCR base model
The creators of the Old Church Slavonic dataset

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for wjbmattingly/LightOnOCR-2-1B-old-church-slavonic-line

Base model

lightonai/LightOnOCR-2-1B-base

Finetuned

(14)

this model

Paper for wjbmattingly/LightOnOCR-2-1B-old-church-slavonic-line

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Paper • 2601.14251 • Published Jan 20 • 26