XLM-RoBERTa-CRF-VotIE: Portuguese Voting Information Extraction
This model is a fine-tuned XLM-RoBERTa Base with a Conditional Random Fields (CRF) layer for extracting structured voting information from Portuguese municipal meeting minutes. It achieves state-of-the-art performance on the Citilink dataset.
What's New in v2.0
- 4 new entity types:
COUNT-AGAINST,COUNT-BLANK,COUNT-FAVOR,VOTING-METHOD
Model Description
XLM-RoBERTa-CRF-VotIE combines the robust multilingual representations of Facebook AI's XLM-RoBERTa base model with a CRF layer for structured sequence prediction. The model performs token-level classification to identify and extract voting-related entities from Portuguese administrative text.
Key Features
- Architecture: XLM-RoBERTa Base (768-dim, 12 layers) + Linear + CRF
- Task: Sequence Labeling with BIO tagging
- Language: Portuguese (Portugal)
- Domain: Municipal meeting minutes and voting records
- Entity Types: 12 types (24 labels with BIO encoding)
- Performance: 93.23% entity-level F1 score
Intended Uses
This model is designed for:
- Extracting voting information from Portuguese municipal documents
- Identifying participants and their voting positions (favor, against, abstention, absent)
- Recognizing voting subjects and counting methods
- Structuring unstructured administrative text
- Research in information extraction from Portuguese administrative documents
Entity Types
The model recognizes 12 entity types in BIO format (24 labels total):
| Entity Type | Description | Example |
|---|---|---|
VOTER-FAVOR |
Participants who voted in favor | "The Municipal Executive" |
VOTER-AGAINST |
Participants who voted against | "João Silva" |
VOTER-ABSTENTION |
Participants who abstained | "The councilor from PS" |
VOTER-ABSENT |
Participants who were absent | "Ana Simões" |
VOTING |
Voting action expressions | "deliberado", "aprovado" |
SUBJECT |
The subject matter being voted on | "budget changes" |
COUNTING-UNANIMITY |
Unanimous vote indicators | "unanimously" |
COUNTING-MAJORITY |
Majority vote indicators | "by majority" |
COUNT-FAVOR |
Numeric count of votes in favor | "5 votes in favor" |
COUNT-AGAINST |
Numeric count of votes against | "3 votes against" |
COUNT-BLANK |
Numeric count of blank votes | "0 blank votes" |
VOTING-METHOD |
Method of voting | "by secret scrutiny" |
Training Details
Training Data
The model was trained on the Citilink dataset (https://rdm.inesctec.pt/dataset/cs-2025-007), which consists of Portuguese municipal meeting minutes annotated with voting information:
- Training set: 1,737 examples
- Validation set: 433 examples
- Test set: 529 examples
- Total tokens: ~300K tokens
- Total entities: ~5K entities
Training Procedure
Hyperparameters:
- Base model:
FacebookAI/xlm-roberta-base - Batch size: 8
- Learning rate: 5e-5 (linear decay)
- Weight decay: 0.01
- Dropout: 0.1
- Max sequence length: 512 tokens
- Epochs: 10
- Optimizer: AdamW
Results
Entity-Level Performance (Test Set)
| Metric | Score |
|---|---|
| F1 Score | 93.23% |
| Precision | 91.93% |
| Recall | 94.70% |
| Accuracy | 98.99% |
Per-Entity Performance
| Entity Type | Precision | Recall | F1 Score | Support |
|---|---|---|---|---|
| COUNT-AGAINST | — | — | — | 0* |
| COUNT-BLANK | 66.67% | 80.00% | 72.73% | 5 |
| COUNT-FAVOR | 100.00% | 100.00% | 100.00% | 4 |
| COUNTING-MAJORITY | 98.33% | 100.00% | 99.16% | 59 |
| COUNTING-UNANIMITY | 99.39% | 99.69% | 99.54% | 326 |
| SUBJECT | 77.60% | 83.18% | 80.29% | 529 |
| VOTER-ABSENT | 83.33% | 83.33% | 83.33% | 18 |
| VOTER-ABSTENTION | 90.60% | 100.00% | 95.07% | 135 |
| VOTER-AGAINST | 100.00% | 100.00% | 100.00% | 36 |
| VOTER-FAVOR | 95.34% | 96.54% | 95.94% | 318 |
| VOTING | 100.00% | 98.98% | 99.49% | 489 |
| VOTING-METHOD | 100.00% | 100.00% | 100.00% | 5 |
* COUNT-AGAINST had no instances in the test set. The entity type is supported and trained but not evaluated on this split.
Usage
Quick Start
The simplest way to use the model:
from transformers import AutoTokenizer, AutoModel
# Load model
model_name = "Anonymous3445/XLM-RoBERTa-CRF-VotIE"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
# Analyze text
text = "O Executivo deliberou aprovar o projeto por unanimidade."
inputs = tokenizer(text, return_tensors="pt")
predictions = model.decode(**inputs, tokenizer=tokenizer, text=text)
# Print results
for pred in predictions:
print(f"{pred['word']:20} {pred['label']}")
Output:
O B-VOTER-FAVOR
Executivo I-VOTER-FAVOR
deliberou B-VOTING
aprovar O
o O
projeto O
por B-COUNTING-UNANIMITY
unanimidade. I-COUNTING-UNANIMITY
Extract Structured Entities
The model includes a convenient extract_entities method:
from transformers import AutoTokenizer, AutoModel
model_name = "Anonymous3445/XLM-RoBERTa-CRF-VotIE"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
text = """A Câmara Municipal deliberou aprovar a proposta apresentada pelo
Senhor Presidente. Votaram a favor os Senhores Vereadores João Silva e
Maria Costa. Votou contra o Senhor Vereador Pedro Santos."""
# Get structured entities with character offsets
entities = model.extract_entities(text, tokenizer)
# Print entities by type
for entity_type, mentions in entities.items():
print(f"\n{entity_type}:")
for mention in mentions:
print(f" - {mention['text']} [{mention['start']}:{mention['end']}]")
Output:
VOTER-FAVOR:
- A Câmara Municipal [0:19]
- João Silva [95:105]
- Maria Costa [108:119]
VOTING:
- deliberou [20:29]
VOTER-AGAINST:
- Pedro Santos [152:164]
Token-Level Predictions
For low-level token-based predictions:
from transformers import AutoTokenizer, AutoModel
model_name = "Anonymous3445/XLM-RoBERTa-CRF-VotIE"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
text = "O Executivo deliberou aprovar por unanimidade."
inputs = tokenizer(text, return_tensors="pt")
# Get raw token-level predictions (list of label IDs)
predictions = model.decode(inputs["input_ids"], inputs["attention_mask"])
# Returns: [[0, 7, 15, 8, 0, 0, 2, 10, 0]]
# Convert to labels
id2label = model.config.id2label
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for token, pred_id in zip(tokens, predictions[0]):
if token not in ['<s>', '</s>', '<pad>']:
print(f"{token:20} {id2label[pred_id]}")
With Character Offsets
Useful for highlighting entities in your UI:
from transformers import AutoTokenizer, AutoModel
model_name = "Anonymous3445/XLM-RoBERTa-CRF-VotIE"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
text = "O Executivo deliberou aprovar o projeto por unanimidade."
inputs = tokenizer(text, return_tensors="pt")
# Get predictions with character positions
predictions = model.decode(**inputs, tokenizer=tokenizer, text=text, return_offsets=True)
# Show only entities (non-O tags)
for pred in predictions:
if pred['label'] != 'O':
print(f"{pred['word']:20} {pred['label']:25} [{pred['start']}:{pred['end']}]")
Output:
O B-VOTER-FAVOR [0:1]
Executivo I-VOTER-FAVOR [2:11]
deliberou B-VOTING [12:21]
por B-COUNTING-UNANIMITY [39:42]
unanimidade. I-COUNTING-UNANIMITY [43:55]
Batch Processing
For processing multiple documents:
from transformers import AutoTokenizer, AutoModel
model_name = "Anonymous3445/XLM-RoBERTa-CRF-VotIE"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
texts = [
"A proposta foi aprovada por unanimidade.",
"Votou contra o Vereador João Silva.",
"O Presidente estava ausente na votação."
]
for text in texts:
entities = model.extract_entities(text, tokenizer, return_offsets=False)
print(f"\nText: {text}")
for entity_type, mentions in entities.items():
print(f" {entity_type}: {[m['text'] for m in mentions]}")
Limitations and Bias
Limitations
- Domain-specific: Trained specifically on Portuguese municipal meeting minutes; may not generalize well to other document types
- Portuguese only: Optimized for European Portuguese
- Sequence length: Limited to 512 tokens per window (handles longer documents via windowing)
- Entity types: Limited to 12 predefined voting-related entity types
- Complex sentences: May struggle with highly complex or nested voting descriptions
Requirements
Install the required dependencies:
pip install torch>=2.0.0 transformers>=4.30.0 pytorch-crf>=0.7.2
Model Card Authors
- Anonymous Authors (for blind review)
Model Card Contact
For questions or issues, please open an issue in the GitHub repository.
Additional Resources
- GitHub Repository: https://github.com/Anonymous3445/VotIE
- Dataset: Citilink Dataset
- Paper: [Coming soon]
- Demo: VotIE Demo
License
This model is released under the Creative Commons Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0) license.
- ✅ You can: Use the model for research and commercial purposes with attribution
- ❌ You cannot: Create derivative works or modified versions
- 📝 You must: Provide attribution to the original authors
See LICENSE for full details.
Acknowledgments
This work builds upon:
- XLM-RoBERTa: Facebook AI's XLM-RoBERTa multilingual base model
- pytorch-crf: CRF implementation
- Transformers: Hugging Face Transformers library
Version: 2.0
Last Updated: 2026-04-02
- Downloads last month
- 30
Model tree for Anonymous3445/XLM-RoBERTa-CRF-VotIE
Base model
FacebookAI/xlm-roberta-baseSpaces using Anonymous3445/XLM-RoBERTa-CRF-VotIE 2
Evaluation results
- F1 Score on Citilink-Minutesself-reported0.932
- Precision on Citilink-Minutesself-reported0.919
- Recall on Citilink-Minutesself-reported0.947