TrOCR-Printed for Tigrinya OCR

Tigrinya TrOCR — Printed Variant

Adapting TrOCR for Printed Tigrinya Text Recognition: Word-Aware Loss Weighting for Cross-Script Transfer Learning

A fine-tuned TrOCR model for printed Tigrinya line-level text recognition. This is the printed pre-training variant, fine-tuned from microsoft/trocr-base-printed using vocabulary extension and Word-Aware Loss Weighting to resolve word-boundary failures caused by BPE space-marker conventions.

Model Details

Field	Value
Model name	`Yonatanhaile2026/tigrinya-trocr-printed`
Base model	`microsoft/trocr-base-printed`
Task	Printed Tigrinya OCR (image-to-text)
Language	Tigrinya (`ti`)
Script	Ge'ez
Model type	VisionEncoderDecoderModel
Vocabulary	Extended from 50,265 → 50,495 tokens (230 Ge'ez characters added)
Training data	GLOCR Tigrinya News text-line images (synthetic)

Performance

Evaluated on a held-out test set of 5,000 synthetic Tigrinya text-line images.

Metric	Value
Character Error Rate (CER)	0.22%
Word Error Rate (WER)	0.87%
Exact Match Accuracy	97.20%

Bootstrap 95% Confidence Intervals (1,000 iterations, TrOCR-Printed)

Metric	Point Estimate	95% CI
CER	0.20%	[0.17%, 0.24%]
WER	0.76%	[0.64%, 0.90%]
Accuracy	97.44%	[97.02%, 97.84%]

Comparison (same dataset and split)

Model	CER	WER	Accuracy
TrOCR-Handwritten (fine-tuned)	0.38%	1.15%	96.86%
TrOCR-Printed (fine-tuned)	0.22%	0.87%	97.20%
CRNN-CTC Baseline	0.12%	0.57%	98.20%

Training Details

Hyperparameter	Value
Optimizer	AdamW
Learning rate	`4e-5`
LR scheduler	Linear decay (no warmup)
Epochs	10
Per-device batch size	2
Gradient accumulation steps	4
Effective batch size	8
Mixed precision	FP16
Boundary loss weight	2.0
Random seed	42
Training duration	~2h 40m
Hardware	NVIDIA RTX 5060 Laptop (8 GB GDDR7)

How to Use

from transformers import VisionEncoderDecoderModel, TrOCRProcessor
from PIL import Image

processor = TrOCRProcessor.from_pretrained("Yonatanhaile2026/tigrinya-trocrprinted")
model = VisionEncoderDecoderModel.from_pretrained("Yonatanhaile2026/tigrinya-trocrprinted")

Load your text-line image
image = Image.open("your_tigrinya_text_line.png").convert("RGB")

pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values, num_beams=5, max_length=128)
prediction = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(prediction)

Intended Use

Suitable for:

Printed Tigrinya OCR on synthetic or clean text-line images
Benchmarking Transformer-based OCR for low-resource scripts
Research on cross-script transfer learning and BPE tokenizer adaptation

Not suitable for:

Handwritten Tigrinya OCR
Real scanned documents with degradation, skew, or complex layout
General-purpose multilingual OCR without task-specific validation

Limitations

Trained and evaluated exclusively on synthetic printed data from a single domain (newspaper text lines)
Performance on real-world scanned documents is not validated
Results reflect a single training run on one hardware configuration
The error analysis relies on automatic Unicode-based heuristics

Related Resources

Handwritten variant: Yonatanhaile2026/tigrinya-trocr-handwritten
Code repository: github.com/YoHa2024NKU/Tigrinya_TrOCR_Printed
Dataset: GLOCR — Harvard Dataverse

## Citation

If you use this model, please cite the associated paper and repository:

```bibtex
@misc{medhanie2025tigrinya,
author = {Yonatan Haile Medhanie and Yuanhua Ni},
title = {Adapting TrOCR for Printed Tigrinya Text Recognition: Word-Aware Loss Weighting for Cross-Script Transfer Learning},
year = {2025},
url = {https://github.com/YoHa2024NKU/Tigrinya_TrOCR_Printed}
}

License

MIT License

Downloads last month: 71

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for Yonatanhaile2026/tigrinya-trocrprinted

Base model

microsoft/trocr-base-printed

Finetuned

(24)

this model

Evaluation results

Character Error Rate on GLOCR Tigrinya News Text-Lines
test set self-reported

0.002
Word Error Rate on GLOCR Tigrinya News Text-Lines
test set self-reported

0.009
Exact Match Accuracy on GLOCR Tigrinya News Text-Lines
test set self-reported

0.972