Update README.md

832c291 verified 2 months ago

8.12 kB

	---
	library_name: transformers
	license: apache-2.0
	datasets:
	- kalixlouiis/raw-data
	language:
	- my
	new_version: DatarrX/myX-Tokenizer
	pipeline_tag: feature-extraction
	---
	# DatarrX / myX-Tokenizer-BPE ⚙️

	myX-Tokenizer-BPE is a Byte Pair Encoding (BPE) based tokenizer specifically trained for the Burmese language. Developed by [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis) under [DatarrX (Myanmar Open Source NGO)](https://huggingface.co/DatarrX), this model serves as a baseline for Burmese NLP tasks using the BPE algorithm.

	## 🎯 Objectives & Characteristics

	* BPE Baseline: Designed to provide a standard BPE-based segmentation for Burmese text.
	* Burmese Focus: This model was trained exclusively on Burmese text, making it highly specialized for native scripts.
	* Memory Efficiency: Trained using a RAM-efficient approach with a large-scale corpus.

	## 🛠️ Technical Specifications

	* Algorithm: Byte Pair Encoding (BPE).
	* Vocabulary Size: 64,000.
	* Normalization: NFKC.
	* Features: Byte-fallback, Split Digits, and Dummy Prefix.

	### Training Data
	Trained on [kalixlouiis/raw-data](https://huggingface.co/datasets/kalixlouiis/raw-data) using 1.5 million Burmese-only sentences.

	## ⚠️ Important Considerations (Limitations)

	* English Language Weakness: Since this model was trained purely on Burmese data, it is notably weak in processing English text, often leading to excessive character-level fragmentation for Latin scripts.
	* BPE Nature: Compared to our Unigram models, this BPE version may offer different segmentation logic which might affect certain downstream NLP tasks.

	## Citation

	If you use this tokenizer in your research or project, please cite it as follows:

	### APA 7th Edition
	Khant Sint Heinn. (2026). myX-Tokenizer-BPE: Byte Pair Encoding Baseline for Burmese (Version 1.0) [Computer software]. Hugging Face. https://huggingface.co/DatarrX/myX-Tokenizer-BPE

	### BibTeX
	```BibTeX
	@software{khantsintheinn2026bpe,
	author = {Khant Sint Heinn},
	title = {myX-Tokenizer-BPE: Byte Pair Encoding Baseline for Burmese},
	version = {1.0},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/DatarrX/myX-Tokenizer-BPE},
	note = {BPE algorithm based on Burmese raw data}
	}
	```

	---

	# DatarrX - myX-Tokenizer-BPE (မြန်မာဘာသာ) ⚙️

	myX-Tokenizer-BPE သည် Byte Pair Encoding (BPE) algorithm ကို အသုံးပြု၍ မြန်မာဘာသာစကားအတွက် အထူးရည်ရွယ် တည်ဆောက်ထားသော Tokenizer ဖြစ်ပါသည်။ ဤ Model ကို DatarrX (Myanmar Open Source NGO)](https://huggingface.co/DatarrX) မှ ထုတ်ဝေခြင်းဖြစ်ပြီ [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis) မှ အဓိက ဖန်တီးထားခြင်း ဖြစ်ပါသည်။

	## 🎯 ရည်ရွယ်ချက်နှင့် ထူးခြားချက်များ

	* BPE အခြေခံ: မြန်မာစာသားများကို BPE နည်းပညာဖြင့် ဖြတ်တောက်ရာတွင် စံနှုန်းတစ်ခုအဖြစ် အသုံးပြုနိုင်ရန်။
	* မြန်မာစာ သီးသန့်: ဤ Model ကို မြန်မာစာသား သီးသန့်ဖြင့်သာ လေ့ကျင့်ထားသဖြင့် ဗမာ(မြန်မာ)စာအရေးအသားများအတွက် အထူးပြုထားပါသည်။
	* အရည်အသွေးမြင့် Training: စာကြောင်းပေါင်း ၁.၅ သန်းကို အသုံးပြု၍ RAM-efficient ဖြစ်သော နည်းလမ်းဖြင့် တည်ဆောက်ထားပါသည်။

	## 🛠️ နည်းပညာဆိုင်ရာ အချက်အလက်များ

	* Algorithm: Byte Pair Encoding (BPE)။
	* Vocab Size: 64,000။
	* Normalization: NFKC။
	* Features: Byte-fallback, Split Digits နှင့် Dummy Prefix အင်္ဂါရပ်များ ပါဝင်ပါသည်။

	### အသုံးပြုထားသော Dataset
	[kalixlouiis/raw-data](https://huggingface.co/datasets/kalixlouiis/raw-data) ထဲမှ သန့်စင်ပြီးသား မြန်မာစာကြောင်းပေါင်း ၁.၅ သန်း (1.5 Million) ကို အသုံးပြုထားပါသည်။

	## ⚠️ သိထားရန် ကန့်သတ်ချက်များ

	* အင်္ဂလိပ်စာ အားနည်းမှု: ဤ Model ကို မြန်မာစာ သီးသန့်ဖြင့်သာ Train ထားခြင်းကြောင့် အင်္ဂလိပ်စာလုံးများကို ဖြတ်တောက်ရာတွင် အလွန်အားနည်းပြီး စာလုံးတစ်လုံးချင်းစီ ကွဲထွက်သွားတတ်ပါသည်။
	* BPE ၏ သဘာဝ: ကျွန်တော်တို့၏ Unigram model များနှင့် ယှဉ်ပါက ဖြတ်တောက်ပုံခြင်း ကွဲပြားနိုင်သဖြင့် မိမိအသုံးပြုမည့် task အပေါ် မူတည်၍ ရွေးချယ်ရန် လိုအပ်ပါသည်။

	---

	## 💻 How to Use (အသုံးပြုနည်း)

	```python
	import sentencepiece as spm
	from huggingface_hub import hf_hub_download

	model_path = hf_hub_download(repo_id="DatarrX/myX-Tokenizer-BPE", filename="myX-Tokenizer.model")
	sp = spm.SentencePieceProcessor(model_file=model_path)

	text = "မြန်မာစာကို BPE algorithm နဲ့ ဖြတ်တောက်ကြည့်ခြင်း။"
	print(sp.encode_as_pieces(text))
	```

	# ✍️ Project Authors
	- Developer: [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
	- Organization: [DatarrX (Myanmar Open Source NGO)](https://huggingface.co/DatarrX)

	## Citation

	အကယ်၍ သင်သည် ဤ model ကို သင်၏ သုတေသနလုပ်ငန်းများတွင် အသုံးပြုခဲ့ပါက အောက်ပါအတိုင်း ကိုးကားပေးရန် မေတ္တာရပ်ခံအပ်ပါသည်။

	### APA 7th Edition
	Khant Sint Heinn. (2026). myX-Tokenizer-BPE: Byte Pair Encoding Baseline for Burmese (Version 1.0) [Computer software]. Hugging Face. https://huggingface.co/DatarrX/myX-Tokenizer-BPE

	### BibTeX
	```BibTeX
	@software{khantsintheinn2026bpe,
	author = {Khant Sint Heinn},
	title = {myX-Tokenizer-BPE: Byte Pair Encoding Baseline for Burmese},
	version = {1.0},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/DatarrX/myX-Tokenizer-BPE},
	note = {BPE algorithm based on Burmese raw data}
	}
	```

	## License 📜

	This project is licensed under the Apache License 2.0.

	### What does this mean?
	The Apache License 2.0 is a permissive license that allows you to:

	* Commercial Use: You can use this tokenizer for commercial purposes.
	* Modification: You can modify the model or the code for your specific needs.
	* Distribution: You can share and distribute the original or modified versions.
	* Sublicensing: You can grant sublicenses to others.

	### Conditions:
	* Attribute: You must give appropriate credit to the author (Khant Sint Heinn) and the organization (DatarrX).
	* License Notice: You must include a copy of the license and any original copyright notice in your distribution.

	For more details, you can read the full license text at [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0).