🧠 HuBrain: Tokenization-free Hungarian Semantic Encoder / Tokenizáció-mentes magyar szemantikai encoder

🔗 GitHub Repository: https://github.com/BraienStorm/hubrain-encoder

🌍 English Description

🔗 Source Code (GitHub): https://github.com/BraienStorm/hubrain-encoder

HuBrain is an experimental, character-based Glass-Box Semantic Encoder designed to model the morphological richness and semantic relationships of the Hungarian language without traditional tokenization (e.g., BPE).

🚀 Live Visualization

View the 1280-dimensional semantic latent space projection (PCA/T-SNE) here: 👉 HuBrain Projector Visualization

📈 Training Progress (latest logs)

The model is currently in Phase 2 (Joint Training). Recent logs show high stability and emergent factual knowledge:

POS Accuracy (Pm): ~91.5% - 97.1%
Word Reconstruction (Wm): ~30% - 74% (Emerging)
Latent Stability (Mag): ~100 (Balanced vector magnitude)
Learning Rate: 2.4e-05

🛠️ Technical Specifications

Architecture: Transformer Encoder with RoPE support.
Dimensions: 1536 (256 Anchors + 1280 Semantic Context).
Layers: 18 Layers, 24 Heads.
Input: Raw characters (64-char fixed word length).
Vocab: No OOV issues (Character-level coverage).

📥 Model Download

The weighted model files (.pth) are stored on Hugging Face due to their large size (6.7 GB). You can download them using the following command:

# Required: pip install huggingface_hub
python download_model.py

Or manually from: https://huggingface.co/Braien/HuBrain-Encoder

⚙️ Requirements

pip install torch numpy huggingface_hub

🧪 Diagnostic Tools

test_mask_prediction.py: Context-based word recovery.
test_analogy.py: Semantic analogies (e.g. King-Man+Woman).
export_projector.py: Export to TF Projector format.

⚖️ Licensing & Data Sources

This model was trained using the Webkorpusz 2.0 dataset. By using this model, you agree to comply with the following licenses:

Common Crawl subcorpus: Used under the same terms as Common Crawl itself.
Wikipedia subcorpus & processed data: Licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).
Disclaimer: The training data originates from automated web crawling; the model creator assumes no responsibility for the content.

🇭🇺 Magyar leírás

🔗 Forráskód (GitHub): https://github.com/BraienStorm/hubrain-encoder

A HuBrain egy kísérleti, karakter-alapú Glass-Box Szemantikai Encoder, amely a magyar nyelv morfológiai gazdagságát és szemantikai összefüggéseit modellezi hagyományos tokenizáció (pl. BPE) használata nélkül.

🚀 Élő Vizualizáció

A modell látens terének 1280 dimenziós szemantikai leképzése megtekinthető itt: 👉 HuBrain Projector Visualization

📈 Tréning Állapot (utolsó logok)

A modell jelenleg a Phase 2 (Joint Training) fázisban van. Az utolsó logok stabil tanulást és kialakuló tudást mutatnak:

POS Pontosság (Pm): ~91.5% - 97.1%
Szó Rekonstrukció (Wm): ~30% - 74% (Folyamatosan javul)
Látens Stabilitás (Mag): ~100 (Kiegyensúlyozott vektor magnitúdó)
Tanulási ráta: 2.4e-05

🛠️ Technikai adatok

Architektúra: Transformer Encoder RoPE támogatással.
Dimenziók: 1536 (256 Horgony + 1280 Szemantikai kontextus).
Rétegszám: 18 réteg, 24 fej.
Bemenet: Nyers karakterek (64 karakteres fix szóhossz).
Vocab: Nincs OOV (szótáron kívüli szó) probléma a karakter-szintű lefedettség miatt.

📥 Modell letöltése

A nagyméretű modellfájlok (.pth, összesen 6.7 GB) a Hugging Face-en tárolódnak. Az alábbi parancs futtatásával töltheted le őket:

# Szükséges: pip install huggingface_hub
python download_model.py

Vagy manuálisan innen: https://huggingface.co/Braien/HuBrain-Encoder

⚙️ Követelmények

pip install torch numpy huggingface_hub

🧪 Diagnosztikai eszközök

test_mask_prediction.py: Környezet alapú szó-visszafejtés.
test_analogy.py: Szemantikai analógiák (pl. király - férfi + nő).
export_projector.py: Exportálás TF Projector vizualizációhoz.

⚖️ Licenc és Adatforrások

A modell tanításához a Webkorpusz 2.0 adatbázist használtuk fel. A modell használatával Ön elfogadja az alábbi licencfeltételeket:

Common Crawl alkorpusz: A Common Crawl saját felhasználási feltételei szerint került felhasználásra.
Wikipedia alkorpusz és feldolgozott adatok: A Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) licenc alá tartoznak.
Felelősségkizárás: Az adatok automatizált webes gyűjtésből származnak, a tartalmukért a modell készítője nem vállal felelősséget.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support