eu-pii-anonimization (Polish, mirror)

This is a redistribution mirror of bardsai/eu-pii-anonimization — the Polish-only variant of the bards.ai PII detection model family.

All weights, tokenizer, and configuration files are byte-identical to the original release by bards.ai, used here under its Apache-2.0 license.

This mirror exists so that downstream applications continue to function if the upstream repository becomes unavailable. All credit for training and evaluating this model belongs to bards.ai — please refer to the original repository when accessible.

If you are the original author and would like changes (additional attribution, takedown, etc.), please open a discussion or contact wjarka on Hugging Face.


Polish PII and Sensitive Data Detection

bardsai/eu-pii-anonimization is a token classification model for detecting personally identifiable information (PII) and other regulated or high-sensitivity entities in Polish-language text.

It is the Polish-specialized counterpart to bardsai/eu-pii-anonimization-multilang and shares the same XLM-RoBERTa-base architecture and 35-entity tagging schema.

Key Highlights

  • Language: Polish
  • Task: Token classification
  • Base architecture: XLM-RoBERTa-base
  • Entity schema: 35 sensitive-data classes (B-/I- labeling, plus O)

Intended Use

  • PII redaction in Polish documents, legal filings, tickets, emails, chat logs
  • Dataset sanitization before training, analytics, or sharing
  • Compliance and governance pipelines for sensitive data handling
  • Pre-ingestion filtering for search, retrieval, and RAG systems

Detected Entity Types

Sensitive entity families include:

  • Personal identity and profile data (PERSON_NAME, DATE_OF_BIRTH, PESEL, …)
  • Organization and institutional identifiers (NIP, KRS, REGON, …)
  • Contact details and location data
  • Technical and digital identifiers
  • Financial and commercial information
  • Official document references
  • Health, biometric, and genetic data
  • Special-category personal data

Full label list is defined in config.json (id2label and label2id).

Quick Start

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "wjarka/eu-pii-anonimization-pl"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "Jan Kowalski, PESEL 80010112345, ul. Marszałkowska 12, Warszawa."
inputs = tokenizer(text, return_tensors="pt", truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]

for token, label in zip(tokens, labels):
    if label != "O":
        print(label, token)

For browser inference (Transformers.js / ONNX Runtime Web), the repository ships pre-exported ONNX weights:

import { pipeline } from '@huggingface/transformers';

const ner = await pipeline('token-classification', 'wjarka/eu-pii-anonimization-pl', { dtype: 'q8' });
const out = await ner('Jan Kowalski, PESEL 80010112345.');
console.log(out);

Repository Files

  • config.json — model config and label mapping
  • tokenizer.json, tokenizer_config.json — tokenizer assets
  • onnx/model.onnx — exported ONNX model (fp32, ~1.1 GB)
  • onnx/model_quantized.onnx — INT8 quantized ONNX model (~280 MB)

Limitations

  • Performance can vary by domain, formatting quality, and OCR noise.
  • Ambiguous phrases may require post-processing and human validation.
  • The model should support compliance workflows, not replace legal decisions.

About bards.ai

At bards.ai, the original authors build practical ML systems for NLP, vision, and time series. More info: https://bards.ai

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wjarka/eu-pii-anonimization-pl

Quantized
(1)
this model