mini-ocr โ€” Khmer & English Text Recognition

A lightweight CRNN (CNN + Bi-LSTM) model trained to recognise Khmer and English text from image crops.
It uses a CTC head so it can handle variable-length text without needing segmentation.


Model Architecture

Component Details
CNN backbone 6 ร— Conv-BN-ReLU blocks with MaxPool
Recurrent 2 ร— Bi-LSTM (hidden = 256) with a linear bridge
Output CTC linear โ†’ NUM_CHARS + 1 (blank = 0)
Input Greyscale image, height normalised to 32 px, width variable
Vocabulary 222 characters โ€” lowercase/uppercase Latin, digits, Khmer consonants, vowels, diacritics, punctuation

Files

File Description
model.pt state_dict โ€” load with the class definition below
model_scripted.pt TorchScript version โ€” no class definition needed
vocab.txt One character per line, index = line number (1-based)

Quick Start

Install dependencies

pip install torch torchvision pillow
import torch
import numpy as np
from PIL import Image
from huggingface_hub import hf_hub_download

TOKENS = (
    "abcdefghijklmnopqrstuvwxyz"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "0123456789"
    "แž€แžแž‚แžƒแž„แž…แž†แž‡แžˆแž‰แžŠแž‹แžŒแžแžŽแžแžแž‘แž’แž“แž”แž•แž–แž—แž˜แž™แžšแž›แžœแžแžžแžŸแž แžกแžขแžฃแžคแžฅแžฆแžงแžฉแžชแžซแžฌแžญแžฎแžฏแžฐแžฑแžฒแžณ"
    "แžถแžทแžธแžนแžบแžปแžผแžฝแžพแžฟแŸ€แŸแŸ‚แŸƒแŸ„แŸ…แŸ†แŸ‡แŸˆแŸ‰แŸŠแŸ‹แŸŒแŸแŸŽแŸแŸแŸ‘แŸ’แŸ”แŸ•แŸ–แŸ—แŸ˜แŸ›แŸ"
    "แŸ แŸกแŸขแŸฃแŸคแŸฅแŸฆแŸงแŸจแŸฉแŸณ"
    "!@#$%^&*()-_=+[]{};:'\",.<>?/|\\ "
)
idx2char = {i + 1: c for i, c in enumerate(TOKENS)}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

scripted_path = hf_hub_download(repo_id="phonsobon/mini-ocr", filename="model_scripted.pt")
model = torch.jit.load(scripted_path, map_location=device)
model.eval()

def load_image(path):
    img = Image.open(path).convert("L")
    w, h = img.size
    img = img.resize((int(w / h * 32), 32))
    img = np.array(img, dtype=np.float32) / 255.0
    return torch.tensor(img).unsqueeze(0).unsqueeze(0)

def ctc_decode(logits):
    preds = torch.argmax(logits, dim=2)[:, 0].cpu().numpy()
    prev, text = -1, []
    for p in preds:
        if p != prev and p != 0:
            text.append(idx2char.get(p, ""))
        prev = p
    return "".join(text)

img = load_image("your_image.png").to(device)
with torch.no_grad():
    result = ctc_decode(model(img))
print("OCR result:", result)

Input Format

  • Single text-line image (word, phrase, or a short line of text)
  • Converted to greyscale internally
  • Height resized to 32 px; width scales proportionally
  • Values normalised to [0, 1]

For full-document OCR, first crop individual text lines, then pass each crop to the model.


Training Details

Setting Value
Epochs 50
Optimizer Adam, lr = 1e-4
Loss CTC (blank = 0, zero_infinity = True)
Image height 32 px
Dataset Synthetic โ€” rendered from a vocabulary text file across multiple fonts with noise augmentation (Gaussian, salt-and-pepper, blur, JPEG compression)
Train / Valid / Test split 80 / 10 / 10

Limitations

  • Designed for single text-line crops, not full documents or paragraphs.
  • Performance may degrade on handwritten text (trained on synthetic rendered images).
  • Very small fonts (< 10 px rendered height) may produce errors.

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support