metadata
language:
- vi
license: mit
tags:
- dependency-parsing
- vietnamese
- nlp
- biaffine
datasets:
- undertheseanlp/UDD-1
library_name: underthesea
pipeline_tag: token-classification
Bamboo-1: Vietnamese Dependency Parser
A Vietnamese dependency parser trained on the UDD-1 dataset using the Biaffine architecture.
Overview
Bamboo-1 is a neural dependency parser for Vietnamese that uses:
- Architecture: Biaffine Dependency Parser (Dozat & Manning, 2017)
- Dataset: UDD-1 (Universal Dependency Dataset for Vietnamese)
- Features: Character-level LSTM embeddings
Installation
cd ~/projects/workspace_underthesea/bamboo-1
uv sync
Usage
Training
# Train with default parameters
uv run scripts/train.py
# Train with custom parameters
uv run scripts/train.py --output models/bamboo-1 --max-epochs 200 --feat char
# Train with BERT embeddings
uv run scripts/train.py --feat bert --bert vinai/phobert-base
# Train with Weights & Biases logging
uv run scripts/train.py --wandb
Evaluation
# Evaluate trained model
uv run scripts/evaluate.py --model models/bamboo-1
Prediction
# Interactive prediction
uv run scripts/predict.py --model models/bamboo-1
# Predict from file
uv run scripts/predict.py --model models/bamboo-1 --input input.txt --output output.conllu
Dataset
The UDD-1 dataset is automatically downloaded from HuggingFace:
- Source:
undertheseanlp/UDD-1 - Train: 18,282 sentences
- Validation: 859 sentences
- Test: 859 sentences
- Format: Universal Dependencies (CoNLL-U)
Model Architecture
Input: Vietnamese sentence
β
Word Embeddings + Character LSTM Embeddings
β
BiLSTM Encoder (3 layers, 400 hidden units)
β
Biaffine Attention (Arc + Relation)
β
Output: Dependency tree (head indices + relation labels)
Metrics
- UAS (Unlabeled Attachment Score): Percentage of tokens with correct head
- LAS (Labeled Attachment Score): Percentage of tokens with correct head AND relation
Project Structure
bamboo-1/
βββ README.md
βββ requirements.txt
βββ scripts/
β βββ train.py # Training script
β βββ evaluate.py # Evaluation script
β βββ predict.py # Prediction script
βββ bamboo1/
β βββ corpus.py # UDD-1 corpus loader
βββ models/ # Trained models (generated)
βββ data/ # Downloaded dataset (generated)
References
License
MIT License