LightOnOCR-2-1B for Old Church Slavonic (Line-Level)

LightOnOCR Banner

This model is a fine-tuned version of lightonai/LightOnOCR-2-1B-base specifically trained for line-level OCR of Old Church Slavonic (OCS) manuscripts.

Model Description

This is a line-level model - it expects cropped line images as input, not full pages. Each image should contain a single line of text.

Evaluation Results

Evaluated on 50 samples from the test set:

Metric Base Model Finetuned Improvement
CER (%) 141.34 5.04 +136.30
WER (%) 112.34 27.85 +84.49
Perfect Matches 0 15 +15

Lower CER/WER is better. Higher perfect matches is better.

Example Outputs

# Ground Truth Base Model Finetuned
1 дьници въсиꙗють рєчє сл҃нцє ꙗко лоуна∙ ꙗ... $\text{A} \text{B} \text{C} \text{D} \te... ✓ дьници въсиꙗють рєчє сл҃нцє ꙗко лоуна∙ ꙗ...
2 свѣтьлость тврьдьнаѧ∙ и провѣды чл҃чє нє... СКУТДЛАСТЬТЕРДЕНАД. НПРОВЕДЫ ТЛТНЕЖВЕ свѣтолость тврьдьнаѧ∙ и провѣды чл҃чє нє...
3 риѥ б҃ь∙жажєлицємь маломь въ жатвю∙ свѣт... ОНИКЬ. ЖАЖЕЛНЦЕЛЬДАПОЛЪВЪЖАТБЮ. СЕВЬ риѥ б҃ь∙ жа жє лицємь маломь въ жатвю∙ с...
4 лы дасть ꙁарѧ сиѧти ис тѣлєсє∙ да оть ви... лы дасть ꙁа рѧси ѧти истѣлє сє∙ да оть в...
5 ихь вѣроуѥмо боудєть и даѥмоѥ∙ часть бо ... НЬВЕРОВНЯЛО РОУДЬ ГЪНДА НПЛОЮ. ТАСТЬ БОП... ихь вѣроуѥмо боудєть и да ѥмоѥ∙ часть бо...

✓ = exact match

Usage

Installation

# Requires transformers from source
pip install git+https://github.com/huggingface/transformers
pip install pillow torch

Python Usage

import torch
from transformers import LightOnOcrForConditionalGeneration, LightOnOcrProcessor
from PIL import Image

# Load model and processor
model_id = "wjbmattingly/LightOnOCR-2-1B-old-church-slavonic-line"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32

processor = LightOnOcrProcessor.from_pretrained(model_id)
model = LightOnOcrForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=dtype,
).to(device)

# Load your line image
image = Image.open("your_line_image.jpg").convert("RGB")

# Prepare input
messages = [{"role": "user", "content": [{"type": "image"}]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = processor(
    text=[text],
    images=[[image]],
    return_tensors="pt",
    padding=True,
    size={"longest_edge": 700},
).to(device)
inputs["pixel_values"] = inputs["pixel_values"].to(dtype)

# Generate transcription
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False)

# Decode output
input_length = inputs["input_ids"].shape[1]
generated_ids = outputs[0, input_length:]
transcription = processor.decode(generated_ids, skip_special_tokens=True)

print(transcription)
# Example output: дьници въсиꙗють рєчє сл҃нцє ꙗко лоуна∙ ꙗко

Batch Inference

from datasets import load_dataset

# Load dataset
dataset = load_dataset("wjbmattingly/old-church-slavonic-line", split="train[:10]")

# Process batch
images = [[img.convert("RGB")] for img in dataset["image"]]
messages = [{"role": "user", "content": [{"type": "image"}]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
texts = [text] * len(images)

inputs = processor(
    text=texts,
    images=images,
    return_tensors="pt",
    padding=True,
    size={"longest_edge": 700},
).to(device)
inputs["pixel_values"] = inputs["pixel_values"].to(dtype)

outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False)
predictions = processor.batch_decode(outputs[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)

for pred, gt in zip(predictions, dataset["text"]):
    print(f"Prediction: {pred}")
    print(f"Ground Truth: {gt}")
    print()

Training Details

  • Base Model: lightonai/LightOnOCR-2-1B-base
  • Training Method: Fine-tuning with frozen language model backbone
  • Optimizer: AdamW (fused)
  • Learning Rate: 6e-5 with linear decay
  • Precision: bfloat16

Limitations

  • This model is trained on line-level images only. For full-page transcription, you need to first segment the page into individual lines.
  • Performance may vary on manuscript styles not represented in the training data.
  • Old Church Slavonic has many abbreviations and special characters that may require domain-specific post-processing.

Citation

If you use this model, please cite:

@misc{lightonocr2_ocs_2026,
  title = {LightOnOCR Fine-tuned for Old Church Slavonic},
  author = {William Mattingly},
  year = {2026},
  howpublished = {\url{https://huggingface.co/wjbmattingly/LightOnOCR-2-1B-old-church-slavonic-line}}
}

And the original LightOnOCR paper:

@misc{lightonocr2_2026,
  title = {LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR},
  author = {Said Taghadouini and Adrien Cavaill\`{e}s and Baptiste Aubertin},
  year = {2026},
  howpublished = {\url{https://arxiv.org/pdf/2601.14251}}
}

Acknowledgments

  • LightOn AI for the excellent LightOnOCR base model
  • The creators of the Old Church Slavonic dataset
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wjbmattingly/LightOnOCR-2-1B-old-church-slavonic-line

Finetuned
(14)
this model

Paper for wjbmattingly/LightOnOCR-2-1B-old-church-slavonic-line