SancharExtractor β€” On-Device Immigration Document Field Extraction

A 3.3M parameter encoder-only transformer that extracts structured fields from OCR'd immigration documents. Runs entirely on-device via CoreML (iOS 17+) with zero cloud dependency.

Train on GPU (Free)

Open In Colab

Or upload Sanchar_Train.ipynb to Google Colab manually:

  1. Go to colab.research.google.com
  2. File β†’ Upload notebook β†’ select Sanchar_Train.ipynb
  3. Runtime β†’ Change runtime type β†’ T4 GPU
  4. Runtime β†’ Run all
  5. ~25 minutes to train, downloads model files when done

Architecture

  • Model: 4-layer encoder-only transformer, 256 dim, 8 heads
  • Parameters: 3.3M
  • Vocab: 4,096 tokens (immigration domain)
  • Labels: 160 BIO tags covering 79 canonical fields
  • Documents: 24 immigration document types (I-797, I-94, EAD, visa stamps, etc.)
  • Quantized size: 3.2 MB (INT8)

Quick Validation Results (5K samples)

Metric Score
Token Accuracy 92.9%
Entity F1 86.5%

Production training on 50K samples with GPU expected to reach 93-96% F1.

Files

β”œβ”€β”€ Sanchar_Train.ipynb          # Complete training notebook (run on Colab)
β”œβ”€β”€ ARCHITECTURE.md              # Full system architecture document
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ bio_tags.json            # 160 BIO tag vocabulary
β”‚   β”œβ”€β”€ form_schemas.json        # 24 document schemas, 355 fields
β”‚   β”œβ”€β”€ forms_catalog.json       # 97 USCIS forms catalog
β”‚   β”œβ”€β”€ policy_reference.json    # USCIS policy manual reference
β”‚   β”œβ”€β”€ edge_cases.json          # 59 immigration edge cases
β”‚   └── reminder_rules.json      # 35 deadline rules + 10 compound rules
β”œβ”€β”€ checkpoints/
β”‚   β”œβ”€β”€ best/                    # Best validation F1 checkpoint
β”‚   └── final/                   # Final epoch checkpoint
β”œβ”€β”€ tokenizer/
β”‚   β”œβ”€β”€ sanchar_tokenizer.model  # SentencePiece BPE (4096 vocab)
β”‚   └── sanchar_vocab.json       # Simple tokenizer fallback
└── coreml/
    β”œβ”€β”€ SancharExtractor_int8.mlpackage  # 3.2 MB, ready for iOS
    └── SancharExtractor_fp32.mlpackage  # 6.2 MB baseline

On-Device Pipeline

Camera β†’ Apple Vision OCR β†’ Document Classifier β†’ SancharExtractor (CoreML) β†’ Structured JSON β†’ Encrypted Vault β†’ Reminder Engine

All PII stays on device. Zero cloud processing.

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support