eu-pii-anonimization (Polish, mirror)

This is a redistribution mirror of bardsai/eu-pii-anonimization — the Polish-only variant of the bards.ai PII detection model family.

All weights, tokenizer, and configuration files are byte-identical to the original release by bards.ai, used here under its Apache-2.0 license.

This mirror exists so that downstream applications continue to function if the upstream repository becomes unavailable. All credit for training and evaluating this model belongs to bards.ai — please refer to the original repository when accessible.

If you are the original author and would like changes (additional attribution, takedown, etc.), please open a discussion or contact wjarka on Hugging Face.

Polish PII and Sensitive Data Detection

bardsai/eu-pii-anonimization is a token classification model for detecting personally identifiable information (PII) and other regulated or high-sensitivity entities in Polish-language text.

It is the Polish-specialized counterpart to bardsai/eu-pii-anonimization-multilang and shares the same XLM-RoBERTa-base architecture and 35-entity tagging schema.

Key Highlights

Language: Polish
Task: Token classification
Base architecture: XLM-RoBERTa-base
Entity schema: 35 sensitive-data classes (B-/I- labeling, plus O)

Intended Use

PII redaction in Polish documents, legal filings, tickets, emails, chat logs
Dataset sanitization before training, analytics, or sharing
Compliance and governance pipelines for sensitive data handling
Pre-ingestion filtering for search, retrieval, and RAG systems

Detected Entity Types

Sensitive entity families include:

Personal identity and profile data (PERSON_NAME, DATE_OF_BIRTH, PESEL, …)
Organization and institutional identifiers (NIP, KRS, REGON, …)
Contact details and location data
Technical and digital identifiers
Financial and commercial information
Official document references
Health, biometric, and genetic data
Special-category personal data

Full label list is defined in config.json (id2label and label2id).

Quick Start

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "wjarka/eu-pii-anonimization-pl"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "Jan Kowalski, PESEL 80010112345, ul. Marszałkowska 12, Warszawa."
inputs = tokenizer(text, return_tensors="pt", truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]

for token, label in zip(tokens, labels):
    if label != "O":
        print(label, token)

For browser inference (Transformers.js / ONNX Runtime Web), the repository ships pre-exported ONNX weights:

import { pipeline } from '@huggingface/transformers';

const ner = await pipeline('token-classification', 'wjarka/eu-pii-anonimization-pl', { dtype: 'q8' });
const out = await ner('Jan Kowalski, PESEL 80010112345.');
console.log(out);

Repository Files

config.json — model config and label mapping
tokenizer.json, tokenizer_config.json — tokenizer assets
onnx/model.onnx — exported ONNX model (fp32, ~1.1 GB)
onnx/model_quantized.onnx — INT8 quantized ONNX model (~280 MB)

Limitations

Performance can vary by domain, formatting quality, and OCR noise.
Ambiguous phrases may require post-processing and human validation.
The model should support compliance workflows, not replace legal decisions.

About bards.ai

At bards.ai, the original authors build practical ML systems for NLP, vision, and time series. More info: https://bards.ai

Downloads last month: -

Model tree for wjarka/eu-pii-anonimization-pl

Base model

bardsai/eu-pii-anonimization

Quantized

(1)

this model