eu-pii-anonimization (Polish, mirror)
This is a redistribution mirror of
bardsai/eu-pii-anonimization— the Polish-only variant of the bards.ai PII detection model family.All weights, tokenizer, and configuration files are byte-identical to the original release by bards.ai, used here under its Apache-2.0 license.
This mirror exists so that downstream applications continue to function if the upstream repository becomes unavailable. All credit for training and evaluating this model belongs to bards.ai — please refer to the original repository when accessible.
If you are the original author and would like changes (additional attribution, takedown, etc.), please open a discussion or contact
wjarkaon Hugging Face.
Polish PII and Sensitive Data Detection
bardsai/eu-pii-anonimization is a token classification model for detecting personally identifiable information (PII) and other regulated or high-sensitivity entities in Polish-language text.
It is the Polish-specialized counterpart to bardsai/eu-pii-anonimization-multilang and shares the same XLM-RoBERTa-base architecture and 35-entity tagging schema.
Key Highlights
- Language: Polish
- Task: Token classification
- Base architecture: XLM-RoBERTa-base
- Entity schema: 35 sensitive-data classes (
B-/I-labeling, plusO)
Intended Use
- PII redaction in Polish documents, legal filings, tickets, emails, chat logs
- Dataset sanitization before training, analytics, or sharing
- Compliance and governance pipelines for sensitive data handling
- Pre-ingestion filtering for search, retrieval, and RAG systems
Detected Entity Types
Sensitive entity families include:
- Personal identity and profile data (PERSON_NAME, DATE_OF_BIRTH, PESEL, …)
- Organization and institutional identifiers (NIP, KRS, REGON, …)
- Contact details and location data
- Technical and digital identifiers
- Financial and commercial information
- Official document references
- Health, biometric, and genetic data
- Special-category personal data
Full label list is defined in config.json (id2label and label2id).
Quick Start
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "wjarka/eu-pii-anonimization-pl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
text = "Jan Kowalski, PESEL 80010112345, ul. Marszałkowska 12, Warszawa."
inputs = tokenizer(text, return_tensors="pt", truncation=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]
for token, label in zip(tokens, labels):
if label != "O":
print(label, token)
For browser inference (Transformers.js / ONNX Runtime Web), the repository ships pre-exported ONNX weights:
import { pipeline } from '@huggingface/transformers';
const ner = await pipeline('token-classification', 'wjarka/eu-pii-anonimization-pl', { dtype: 'q8' });
const out = await ner('Jan Kowalski, PESEL 80010112345.');
console.log(out);
Repository Files
config.json— model config and label mappingtokenizer.json,tokenizer_config.json— tokenizer assetsonnx/model.onnx— exported ONNX model (fp32, ~1.1 GB)onnx/model_quantized.onnx— INT8 quantized ONNX model (~280 MB)
Limitations
- Performance can vary by domain, formatting quality, and OCR noise.
- Ambiguous phrases may require post-processing and human validation.
- The model should support compliance workflows, not replace legal decisions.
About bards.ai
At bards.ai, the original authors build practical ML systems for NLP, vision, and time series. More info: https://bards.ai
- Downloads last month
- -
Model tree for wjarka/eu-pii-anonimization-pl
Base model
bardsai/eu-pii-anonimization