DHPLT: large-scale multilingual diachronic corpora and word representations for semantic change modelling
Abstract
In this resource paper, we present DHPLT, an open collection of diachronic corpora in 41 diverse languages. DHPLT is based on the web-crawled HPLT datasets; we use web crawl timestamps as the approximate signal of document creation time. The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language). We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets. DHPLT aims at filling in the current lack of multilingual diachronic corpora for semantic change modelling (beyond a dozen of high-resource languages). It opens the way for a variety of new experimental setups in this field. All the resources described in this paper are available at https://data.hplt-project.org/three/diachronic/, sorted by language.
Community
DHPLT is an open collection of diachronic corpora for semantic change modeling in 41 languages.
The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language).
We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Effective vocabulary expanding of multilingual language models for extremely low-resource languages (2026)
- A Hybrid Protocol for Large-Scale Semantic Dataset Generation in Low-Resource Languages: The Turkish Semantic Relations Corpus (2026)
- UrduLM: A Resource-Efficient Monolingual Urdu Language Model (2026)
- LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval (2026)
- Word-Centered Semantic Graphs for Interpretable Diachronic Sense Tracking (2026)
- Transparent Semantic Change Detection with Dependency-Based Profiles (2026)
- CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper