mini-ocr — Khmer & English Text Recognition

A lightweight CRNN (CNN + Bi-LSTM) model trained to recognise Khmer and English text from image crops.
It uses a CTC head so it can handle variable-length text without needing segmentation.

Model Architecture

Component	Details
CNN backbone	6 × Conv-BN-ReLU blocks with MaxPool
Recurrent	2 × Bi-LSTM (hidden = 256) with a linear bridge
Output	CTC linear → `NUM_CHARS + 1` (blank = 0)
Input	Greyscale image, height normalised to 32 px, width variable
Vocabulary	222 characters — lowercase/uppercase Latin, digits, Khmer consonants, vowels, diacritics, punctuation

Files

File	Description
`model.pt`	`state_dict` — load with the class definition below
`model_scripted.pt`	TorchScript version — no class definition needed
`vocab.txt`	One character per line, index = line number (1-based)

Quick Start

Install dependencies

pip install torch torchvision pillow

import torch
import numpy as np
from PIL import Image
from huggingface_hub import hf_hub_download

TOKENS = (
    "abcdefghijklmnopqrstuvwxyz"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "0123456789"
    "កខគឃងចឆជឈញដឋឌឍណតថទធនបផពភមយរលវឝឞសហឡអឣឤឥឦឧឩឪឫឬឭឮឯឰឱឲឳ"
    "ាិីឹឺុូួើឿៀេែៃោៅំះៈ៉៊់៌៍៎៏័៑្។៕៖ៗ៘៛៝"
    "០១២៣៤៥៦៧៨៩៳"
    "!@#$%^&*()-_=+[]{};:'\",.<>?/|\\ "
)
idx2char = {i + 1: c for i, c in enumerate(TOKENS)}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

scripted_path = hf_hub_download(repo_id="phonsobon/mini-ocr", filename="model_scripted.pt")
model = torch.jit.load(scripted_path, map_location=device)
model.eval()

def load_image(path):
    img = Image.open(path).convert("L")
    w, h = img.size
    img = img.resize((int(w / h * 32), 32))
    img = np.array(img, dtype=np.float32) / 255.0
    return torch.tensor(img).unsqueeze(0).unsqueeze(0)

def ctc_decode(logits):
    preds = torch.argmax(logits, dim=2)[:, 0].cpu().numpy()
    prev, text = -1, []
    for p in preds:
        if p != prev and p != 0:
            text.append(idx2char.get(p, ""))
        prev = p
    return "".join(text)

img = load_image("your_image.png").to(device)
with torch.no_grad():
    result = ctc_decode(model(img))
print("OCR result:", result)

Input Format

Single text-line image (word, phrase, or a short line of text)
Converted to greyscale internally
Height resized to 32 px; width scales proportionally
Values normalised to [0, 1]

For full-document OCR, first crop individual text lines, then pass each crop to the model.

Training Details

Setting	Value
Epochs	50
Optimizer	Adam, lr = 1e-4
Loss	CTC (`blank = 0`, `zero_infinity = True`)
Image height	32 px
Dataset	Synthetic — rendered from a vocabulary text file across multiple fonts with noise augmentation (Gaussian, salt-and-pepper, blur, JPEG compression)
Train / Valid / Test split	80 / 10 / 10

Limitations

Designed for single text-line crops, not full documents or paragraphs.
Performance may degrade on handwritten text (trained on synthetic rendered images).
Very small fonts (< 10 px rendered height) may produce errors.

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support