AI & ML interests

Web as a corpus, Large Language Models, Machine Translation, Language Technologies, Natural Language Processing, Internet Archive, CommonCrawl

Recent Activity

ltgoslo  updated a model 3 days ago
HPLT/FinOLMo-13B
ltgoslo  updated a model 3 days ago
HPLT/NorOLMo-13B
ltgoslo  updated a model 3 days ago
HPLT/NorOLMo-13B
View all activity

BramVanroy 
posted an update 4 months ago
view post
Post
458
What are currently the best multilingual models with at most 72B parameters? Are Llama 3.3 70B and Qwen 2.5 72B still king?
  • 1 reply
·
davanstrien 
posted an update 5 months ago
BramVanroy 
posted an update 5 months ago
view post
Post
926
Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually

- C5f ( BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2;
- C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content.

It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.
davanstrien 
posted an update 8 months ago
view post
Post
3691
Inspired by Hugging Face's official MCP server, I've developed a complementary tool that exposes my semantic search API to enhance discovery across the HF platform.

Key capabilities:

- AI-powered semantic search for models and datasets
- Parameter count analysis via safetensors metadata
- Trending content discovery
- Find similar models/datasets functionality
- 11 tools total for enhanced ecosystem navigation

The semantic search goes beyond simple keyword matching, understanding context and relationships between different models and datasets.

Example query: "Find around 10 reasoning Hugging Face datasets published in 2025 focusing on topics other than maths and science. Show a link and a short summary for each dataset." (results in video!)

https://github.com/davanstrien/hub-semantic-search-mcp
  • 1 reply
·