File size: 5,869 Bytes

---
library_name: pytorch
tags:
- hungarian
- transformer
- encoder
- tokenization-free
- character-based
- glass-box
license: cc-by-sa-4.0
datasets:
- webkorpusz-2.0
metrics:
- pos-accuracy
- word-reconstruction
---

# 🧠 HuBrain: Tokenization-free Hungarian Semantic Encoder / Tokenizáció-mentes magyar szemantikai encoder

[English Version](#english) | [Magyar Változat](#magyar)

🔗 **GitHub Repository:** [https://github.com/BraienStorm/hubrain-encoder](https://github.com/BraienStorm/hubrain-encoder)

---

<a name="english"></a>
## 🌍 English Description

🔗 **Source Code (GitHub):** [https://github.com/BraienStorm/hubrain-encoder](https://github.com/BraienStorm/hubrain-encoder)

HuBrain is an experimental, character-based **Glass-Box Semantic Encoder** designed to model the morphological richness and semantic relationships of the Hungarian language without traditional tokenization (e.g., BPE).

### 🚀 Live Visualization
View the 1280-dimensional semantic latent space projection (PCA/T-SNE) here:
👉 **[HuBrain Projector Visualization](https://projector.tensorflow.org/?config=https://jevcsak.hu/model/hubrain.json)**

![Latent Space Projection](latens_space.png)

### 📈 Training Progress (latest logs)
The model is currently in **Phase 2 (Joint Training)**. Recent logs show high stability and emergent factual knowledge:
- **POS Accuracy (Pm):** ~91.5% - 97.1%
- **Word Reconstruction (Wm):** ~30% - 74% (Emerging)
- **Latent Stability (Mag):** ~100 (Balanced vector magnitude)
- **Learning Rate:** 2.4e-05

### 🛠️ Technical Specifications
- **Architecture:** Transformer Encoder with RoPE support.
- **Dimensions:** 1536 (256 Anchors + 1280 Semantic Context).
- **Layers:** 18 Layers, 24 Heads.
- **Input:** Raw characters (64-char fixed word length).
- **Vocab:** No OOV issues (Character-level coverage).

### 📥 Model Download
The weighted model files (`.pth`) are stored on Hugging Face due to their large size (6.7 GB). You can download them using the following command:

```bash
# Required: pip install huggingface_hub
python download_model.py
```
Or manually from: [https://huggingface.co/Braien/HuBrain-Encoder](https://huggingface.co/Braien/HuBrain-Encoder)

### ⚙️ Requirements
```bash
pip install torch numpy huggingface_hub
```

### 🧪 Diagnostic Tools
- **`test_mask_prediction.py`**: Context-based word recovery.
- **`test_analogy.py`**: Semantic analogies (e.g. King-Man+Woman).
- **`export_projector.py`**: Export to TF Projector format.

### ⚖️ Licensing & Data Sources
This model was trained using the **Webkorpusz 2.0** dataset. By using this model, you agree to comply with the following licenses:
- **Common Crawl subcorpus**: Used under the same terms as [Common Crawl](https://commoncrawl.org/terms-of-use/) itself.
- **Wikipedia subcorpus & processed data**: Licensed under **Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)**.
- **Disclaimer**: The training data originates from automated web crawling; the model creator assumes no responsibility for the content.

---

<a name="magyar"></a>
## 🇭🇺 Magyar leírás

🔗 **Forráskód (GitHub):** [https://github.com/BraienStorm/hubrain-encoder](https://github.com/BraienStorm/hubrain-encoder)

A HuBrain egy kísérleti, karakter-alapú **Glass-Box Szemantikai Encoder**, amely a magyar nyelv morfológiai gazdagságát és szemantikai összefüggéseit modellezi hagyományos tokenizáció (pl. BPE) használata nélkül.

### 🚀 Élő Vizualizáció
A modell látens terének 1280 dimenziós szemantikai leképzése megtekinthető itt:
👉 **[HuBrain Projector Visualization](https://projector.tensorflow.org/?config=https://jevcsak.hu/model/hubrain.json)**

![Látens tér projekció](latens_space.png)

### 📈 Tréning Állapot (utolsó logok)
A modell jelenleg a **Phase 2 (Joint Training)** fázisban van. Az utolsó logok stabil tanulást és kialakuló tudást mutatnak:
- **POS Pontosság (Pm):** ~91.5% - 97.1%
- **Szó Rekonstrukció (Wm):** ~30% - 74% (Folyamatosan javul)
- **Látens Stabilitás (Mag):** ~100 (Kiegyensúlyozott vektor magnitúdó)
- **Tanulási ráta:** 2.4e-05

### 🛠️ Technikai adatok
- **Architektúra:** Transformer Encoder RoPE támogatással.
- **Dimenziók:** 1536 (256 Horgony + 1280 Szemantikai kontextus).
- **Rétegszám:** 18 réteg, 24 fej.
- **Bemenet:** Nyers karakterek (64 karakteres fix szóhossz).
- **Vocab:** Nincs OOV (szótáron kívüli szó) probléma a karakter-szintű lefedettség miatt.

### 📥 Modell letöltése
A nagyméretű modellfájlok (`.pth`, összesen 6.7 GB) a Hugging Face-en tárolódnak. Az alábbi parancs futtatásával töltheted le őket:

```bash
# Szükséges: pip install huggingface_hub
python download_model.py
```
Vagy manuálisan innen: [https://huggingface.co/Braien/HuBrain-Encoder](https://huggingface.co/Braien/HuBrain-Encoder)

### ⚙️ Követelmények
```bash
pip install torch numpy huggingface_hub
```

### 🧪 Diagnosztikai eszközök
- **`test_mask_prediction.py`**: Környezet alapú szó-visszafejtés.
- **`test_analogy.py`**: Szemantikai analógiák (pl. király - férfi + nő).
- **`export_projector.py`**: Exportálás TF Projector vizualizációhoz.

### ⚖️ Licenc és Adatforrások
A modell tanításához a **Webkorpusz 2.0** adatbázist használtuk fel. A modell használatával Ön elfogadja az alábbi licencfeltételeket:
- **Common Crawl alkorpusz**: A [Common Crawl](https://commoncrawl.org/terms-of-use/) saját felhasználási feltételei szerint került felhasználásra.
- **Wikipedia alkorpusz és feldolgozott adatok**: A **Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)** licenc alá tartoznak.
- **Felelősségkizárás**: Az adatok automatizált webes gyűjtésből származnak, a tartalmukért a modell készítője nem vállal felelősséget.