MyNLI: Burmese Natural Language Inference with XLM-RoBERTa
A Burmese Natural Language Inference (NLI) model fine-tuned from xlm-roberta-base, trained on a curated Burmese NLI dataset combining cleaned native data, manual annotations, and translated English NLI samples.
This model predicts the relationship between a premise and a hypothesis as one of:
- Entailment
- Neutral
- Contradiction
Model Details
- Base model:
xlm-roberta-base - Language: Burmese (Myanmar)
- Task: Natural Language Inference (NLI)
- Labels:
entailment,neutral,contradiction - Framework: Transformers / PyTorch
Dataset
The model is trained on an ~8K Burmese NLI dataset, prepared from:
- Cleaned Burmese NLI data (source: [(https://huggingface.co/datasets/akhtet/myanmar-xnli)])
- Additional manually created samples
- Translated English NLI data for diversity
Dataset Structure
Most samples follow a 1 premise β 3 hypotheses structure
Each hypothesis has a different NLI label
An additional
genrefield is included- Intended for future zero-shot / cross-genre experiments
- Not used during training yet
Preprocessing
Since a pretrained multilingual LLM is used, no manual tokenization (word-level or syllable-level) is applied.
Steps:
- Unicode normalization (NFC)
- Zawgyi detection
- Automatic conversion to Unicode if Zawgyi text is detected
- Rely on XLM-R subword tokenizer for tokenization
Data Splitting Strategy
To prevent data leakage caused by shared premises:
- Train: 70%
- Validation: 15%
- Test: 15%
Instead of random shuffling:
GroupShuffleSplit is used
Samples with the same premise always stay in the same split
Prevents:
- Premise overlap across splits
- Hypothesis leakage between train / validation / test sets
Training Setup
- Epochs: 4
- Learning rate:
2e-5 - Batch size: 8
- Weight decay:
0.01 - Warmup ratio:
0.1 - FP16 training
- Best model selected by: Validation F1 score
- Seed: 42
Evaluation Metrics
The model is evaluated using:
- Accuracy
- Macro F1-score
Training Results
| Epoch | Train Loss | Val Loss | Accuracy | F1 |
|---|---|---|---|---|
| 1 | 0.9509 | 0.8948 | 0.5602 | 0.5143 |
| 2 | 0.7850 | 0.6888 | 0.7233 | 0.7153 |
| 3 | 0.6067 | 0.6367 | 0.7660 | 0.7603 |
| 4 | 0.4301 | 0.7060 | 0.7455 | 0.7411 |
Best checkpoint selected based on F1 score.
Test Set Performance
{
"eval_loss": 0.7713,
"eval_accuracy": 0.7868,
"eval_f1": 0.7852
}
Confusion Matrix on Test Set
[[352 57 38]
[ 53 270 34]
[ 27 46 319]]
Rows represent true labels, columns represent predicted labels (Label order: entailment, neutral, contradiction)
Inference Example
You can use the model as follows:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "emilyyy04/xlm-roberta-base-burmese-nli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
premise = "αα°ααααΊ αα±αΈαα―αΆαα½ααΊ ααα¬αααΊα‘ααΌα
αΊ α‘αα―ααΊαα―ααΊαα±αααΊα"
hypothesis = "αα°ααααΊ αα»ααΊαΈαα¬αα±αΈαα―ααΊαααΊαΈαα½ααΊ α‘αα―ααΊαα―ααΊαα±αααΊα"
inputs = tokenizer(
premise,
hypothesis,
return_tensors="pt",
truncation=True,
padding=True
)
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()
label_map = {0: "entailment", 1: "neutral", 2: "contradiction"}
print("Predicted label:", label_map[predicted_class])
# conf
probs = torch.softmax(outputs.logits, dim=-1)[0]
print("Confidence:", {k: round(float(probs[i]), 3) for i, k in label_map.items()})
Limitations & Future Work
Genre-aware and zero-shot classification is planned but not yet implemented
Performance may vary for:
- Very long inputs
- Out-of-domain or highly informal Burmese
Future improvements:
- Larger native Burmese NLI dataset
- Explicit genre-based evaluation
- Domain adaptation
- Downloads last month
- 10
Model tree for emilyyy04/xlm-roberta-base-burmese-nli
Base model
FacebookAI/xlm-roberta-base