Knowledge Platform NER

A cross-domain, multilingual Named Entity Recognition model built for the Knowledge Platform — a system that connects patents, scientific papers, news articles, and political documents across 13 data sources.

Fine-tuned from answerdotai/ModernBERT-base on 256K+ multilingual documents spanning patents (USPTO, EPO), scientific papers (OpenAlex, arXiv), political documents (Bundestag, EU Parliament), and news.

Key Results

Metric	Score
F1	90.6%
Precision	89.5%
Recall	91.8%
Accuracy	98.1%

Entity Types

The model recognizes 15 entity types using BIO tagging (31 labels total):

Tag	Entity Type	Example
`PER`	Person	James Chen, Lisa Paus, Yann LeCun
`ORG`	Organization	Samsung Electronics, Bundestag, OpenAI
`LOC`	Location	Seoul, Brüssel, New York
`ANIM`	Animal	E. coli, SARS-CoV-2
`BIO`	Biological	CRISPR-Cas9, mRNA
`CEL`	Celestial Body	Mars, Jupiter
`DIS`	Disease	Alzheimer's, sickle cell disease
`EVE`	Event	COP28, World Economic Forum
`FOOD`	Food	glyphosate, insulin
`INST`	Instrument	LiDAR, mass spectrometer
`MEDIA`	Media/Work	Nature, The Lancet
`MYTH`	Mythological	Apollo (program context)
`PLANT`	Plant	Arabidopsis, cannabis sativa
`TIME`	Time	Q3 2025, fiscal year 2024
`VEHI`	Vehicle	Falcon 9, Boeing 787

Use Cases

This model is designed for knowledge graph construction from heterogeneous document collections:

Patent Analysis: Extract assignees, inventors, locations, and technologies from patent filings
Scientific Literature: Identify authors, institutions, biological entities, and instruments from papers
Political Document Processing: Extract politicians, parties, organizations from parliamentary debates (EN + DE)
News Processing: Identify key entities across news articles for event tracking
Cross-Domain Knowledge Graphs: Connect entities that appear across different document types and languages

Works with the Knowledge Platform Embedding Model

This model is designed to work alongside deepakint/knowledge-platform-embeddings — a SciNCL-based embedding model fine-tuned with contrastive learning on the same document corpus.

Together they form a pipeline:

This NER model extracts entities (the nodes of a knowledge graph)
The embedding model finds document connections (the edges of a knowledge graph)

Quick Start

from transformers import pipeline

ner = pipeline(
    "ner",
    model="deepakint/knowledge-platform-ner",
    aggregation_strategy="max"
)

# English patent text
text = "Samsung Electronics Co., Ltd. filed a patent at the USPTO in Washington, D.C."
entities = ner(text)

for entity in entities:
    print(f"  {entity['word']:40s} {entity['entity_group']:10s} {entity['score']:.3f}")

  Samsung Electronics Co., Ltd.            ORG        1.000
  USPTO                                    ORG        0.998
  Washington, D.C.                         LOC        0.999

# German political text
text = "Lisa Paus sprach im Deutschen Bundestag in Berlin über die neue Regulierung."
entities = ner(text)

for entity in entities:
    print(f"  {entity['word']:40s} {entity['entity_group']:10s} {entity['score']:.3f}")

  Lisa Paus                                PER        1.000
  Deutschen Bundestag                      ORG        1.000
  Berlin                                   LOC        1.000

Grouping Entities by Type

from collections import defaultdict

text = """Apple Inc. CEO Tim Cook announced a new research lab in Palo Alto, 
California, partnering with Stanford University on CRISPR gene editing research."""

entities = ner(text)
grouped = defaultdict(list)
for ent in entities:
    grouped[ent["entity_group"]].append(ent["word"])

for label, names in sorted(grouped.items()):
    print(f"  {label:8s}: {names}")

  BIO     : ['CRISPR']
  LOC     : ['Palo Alto', 'California']
  ORG     : ['Apple Inc.', 'Stanford University']
  PER     : ['Tim Cook']

Training Details

Base Model

answerdotai/ModernBERT-base — a 149M parameter encoder model with:

8,192 token context length (vs. 512 for classic BERT)
Rotary Position Embeddings (RoPE)
Alternating full + sliding window attention
Pre-trained on 2 trillion tokens of English text

Training Data

~256,000 documents from 13 data sources across multiple domains and languages:

Domain	Sources	Language
Patents	USPTO, EPO	EN, DE
Scientific Papers	OpenAlex, arXiv	EN
Political Documents	Bundestag, EU Parliament	DE, EN
News	Various	EN, DE

Hyperparameters

Parameter	Value
Learning rate	2e-05
Batch size	16 (x2 gradient accumulation = 32 effective)
Epochs	3
Optimizer	AdamW
LR scheduler	Cosine with 10% warmup
Seed	42

Training Progress

Epoch	Training Loss	Validation Loss	Precision	Recall	F1	Accuracy
1	0.1276	0.0766	0.8595	0.8361	0.8476	0.9728
2	0.0927	0.0623	0.8659	0.8923	0.8789	0.9777
3	0.0422	0.0694	0.8707	0.8949	0.8827	0.9778

Note: The best checkpoint (epoch ~2, lowest validation loss 0.0606) was selected as the final model, achieving 90.6% F1.

Strengths and Limitations

Strengths

Cross-domain: Works on patents, papers, news, and political documents with a single model
Multilingual: Handles both English and German text
Rich entity types: 15 entity types covering people, organizations, locations, biological entities, diseases, instruments, and more
Fast: ~5ms per document on CPU — suitable for processing millions of documents
Long context: Inherits ModernBERT's 8,192 token context window

Limitations

Conference/product names: May fragment uncommon compound names (e.g., "NeurIPS" split into tokens) — use confidence thresholding (>0.5) to filter
Languages: Optimized for English and German; other languages may work but are untested
Domain drift: Performance is best on patent, scientific, political, and news text — may degrade on informal text (social media, chat)

Framework Versions

Transformers: 5.6.0
PyTorch: 2.5.1+cu121
Datasets: 4.8.4
Tokenizers: 0.22.2

Citation

@misc{knowledge-platform-ner-2026,
  title={Knowledge Platform NER: Cross-Domain Multilingual Named Entity Recognition},
  author={deepakint},
  year={2026},
  url={https://huggingface.co/deepakint/knowledge-platform-ner}
}

Related Models

Embedding Model: deepakint/knowledge-platform-embeddings — Cross-domain semantic search and document matching

Downloads last month: 91

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for deepakint/knowledge-platform-ner

Base model

answerdotai/ModernBERT-base

Finetuned

(1196)

this model

Evaluation results

F1
self-reported

0.906
Precision
self-reported

0.895
Recall
self-reported

0.918
Accuracy
self-reported

0.981

deepakint
/

knowledge-platform-ner