SancharExtractor β On-Device Immigration Document Field Extraction
A 3.3M parameter encoder-only transformer that extracts structured fields from OCR'd immigration documents. Runs entirely on-device via CoreML (iOS 17+) with zero cloud dependency.
Train on GPU (Free)
Or upload Sanchar_Train.ipynb to Google Colab manually:
- Go to colab.research.google.com
- File β Upload notebook β select
Sanchar_Train.ipynb - Runtime β Change runtime type β T4 GPU
- Runtime β Run all
- ~25 minutes to train, downloads model files when done
Architecture
- Model: 4-layer encoder-only transformer, 256 dim, 8 heads
- Parameters: 3.3M
- Vocab: 4,096 tokens (immigration domain)
- Labels: 160 BIO tags covering 79 canonical fields
- Documents: 24 immigration document types (I-797, I-94, EAD, visa stamps, etc.)
- Quantized size: 3.2 MB (INT8)
Quick Validation Results (5K samples)
| Metric | Score |
|---|---|
| Token Accuracy | 92.9% |
| Entity F1 | 86.5% |
Production training on 50K samples with GPU expected to reach 93-96% F1.
Files
βββ Sanchar_Train.ipynb # Complete training notebook (run on Colab)
βββ ARCHITECTURE.md # Full system architecture document
βββ data/
β βββ bio_tags.json # 160 BIO tag vocabulary
β βββ form_schemas.json # 24 document schemas, 355 fields
β βββ forms_catalog.json # 97 USCIS forms catalog
β βββ policy_reference.json # USCIS policy manual reference
β βββ edge_cases.json # 59 immigration edge cases
β βββ reminder_rules.json # 35 deadline rules + 10 compound rules
βββ checkpoints/
β βββ best/ # Best validation F1 checkpoint
β βββ final/ # Final epoch checkpoint
βββ tokenizer/
β βββ sanchar_tokenizer.model # SentencePiece BPE (4096 vocab)
β βββ sanchar_vocab.json # Simple tokenizer fallback
βββ coreml/
βββ SancharExtractor_int8.mlpackage # 3.2 MB, ready for iOS
βββ SancharExtractor_fp32.mlpackage # 6.2 MB baseline
On-Device Pipeline
Camera β Apple Vision OCR β Document Classifier β SancharExtractor (CoreML) β Structured JSON β Encrypted Vault β Reminder Engine
All PII stays on device. Zero cloud processing.
- Downloads last month
- 14