mini-ocr โ Khmer & English Text Recognition
A lightweight CRNN (CNN + Bi-LSTM) model trained to recognise Khmer and English text from image crops.
It uses a CTC head so it can handle variable-length text without needing segmentation.
Model Architecture
| Component | Details |
|---|---|
| CNN backbone | 6 ร Conv-BN-ReLU blocks with MaxPool |
| Recurrent | 2 ร Bi-LSTM (hidden = 256) with a linear bridge |
| Output | CTC linear โ NUM_CHARS + 1 (blank = 0) |
| Input | Greyscale image, height normalised to 32 px, width variable |
| Vocabulary | 222 characters โ lowercase/uppercase Latin, digits, Khmer consonants, vowels, diacritics, punctuation |
Files
| File | Description |
|---|---|
model.pt |
state_dict โ load with the class definition below |
model_scripted.pt |
TorchScript version โ no class definition needed |
vocab.txt |
One character per line, index = line number (1-based) |
Quick Start
Install dependencies
pip install torch torchvision pillow
import torch
import numpy as np
from PIL import Image
from huggingface_hub import hf_hub_download
TOKENS = (
"abcdefghijklmnopqrstuvwxyz"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"0123456789"
"แแแแแแ
แแแแแแแแแแแแแแแแแแแแแแแแแแแ แกแขแฃแคแฅแฆแงแฉแชแซแฌแญแฎแฏแฐแฑแฒแณ"
"แถแทแธแนแบแปแผแฝแพแฟแแแแแแ
แแแแแแแแแแแแแแแแแแแแ"
"แ แกแขแฃแคแฅแฆแงแจแฉแณ"
"!@#$%^&*()-_=+[]{};:'\",.<>?/|\\ "
)
idx2char = {i + 1: c for i, c in enumerate(TOKENS)}
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
scripted_path = hf_hub_download(repo_id="phonsobon/mini-ocr", filename="model_scripted.pt")
model = torch.jit.load(scripted_path, map_location=device)
model.eval()
def load_image(path):
img = Image.open(path).convert("L")
w, h = img.size
img = img.resize((int(w / h * 32), 32))
img = np.array(img, dtype=np.float32) / 255.0
return torch.tensor(img).unsqueeze(0).unsqueeze(0)
def ctc_decode(logits):
preds = torch.argmax(logits, dim=2)[:, 0].cpu().numpy()
prev, text = -1, []
for p in preds:
if p != prev and p != 0:
text.append(idx2char.get(p, ""))
prev = p
return "".join(text)
img = load_image("your_image.png").to(device)
with torch.no_grad():
result = ctc_decode(model(img))
print("OCR result:", result)
Input Format
- Single text-line image (word, phrase, or a short line of text)
- Converted to greyscale internally
- Height resized to 32 px; width scales proportionally
- Values normalised to
[0, 1]
For full-document OCR, first crop individual text lines, then pass each crop to the model.
Training Details
| Setting | Value |
|---|---|
| Epochs | 50 |
| Optimizer | Adam, lr = 1e-4 |
| Loss | CTC (blank = 0, zero_infinity = True) |
| Image height | 32 px |
| Dataset | Synthetic โ rendered from a vocabulary text file across multiple fonts with noise augmentation (Gaussian, salt-and-pepper, blur, JPEG compression) |
| Train / Valid / Test split | 80 / 10 / 10 |
Limitations
- Designed for single text-line crops, not full documents or paragraphs.
- Performance may degrade on handwritten text (trained on synthetic rendered images).
- Very small fonts (< 10 px rendered height) may produce errors.
License
MIT
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support