view article Article Using Storage Buckets as a Working Layer for Data Pipelines davanstrien • Mar 26 • 3
view article Article Train AI models with Unsloth and Hugging Face Jobs for FREE +4 burtenshaw, danielhanchen, shimmyshimmer, mlabonne, davanstrien, evalstate • Feb 20 • 100
view article Article Community Evals: Because we're done trusting black-box leaderboards over the community +5 burtenshaw, SaylorTwift, kramp, merve, davanstrien, nielsr, julien-c • Feb 4 • 89
view article Article Supercharge your OCR Pipelines with Open Models +5 merve, ariG23498, davanstrien, hynky, andito, reach-vb, pcuenq • Oct 21, 2025 • 309
view article Article 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? +7 ljvmiranda921, acocodes, connermanuel, jcblaise, jcblaise, josephimperial, davanstrien, SaylorTwift, clefourrier • Aug 12, 2025 • 23
view article Article FineWeb-C: A Community-Driven Dataset for Educational Quality Annotations in 122 Languages davanstrien • Jul 8, 2025 • 35
view article Article Explore, Curate and Vector Search Any Hugging Face Dataset with Nomic Atlas MaxNomic • Jan 23, 2025 • 30
view article Article FineWeb2-C: Help Build Better Language Models in Your Language davanstrien • Dec 23, 2024 • 21
view article Article Open Preference Dataset for Text-to-Image Generation by the 🤗 Community +5 davidberenstein1957, burtenshaw, dvilasuero, davanstrien, sayakpaul, Ameeeee, linoyts • Dec 9, 2024 • 70
view article Article Let’s make a generation of amazing image generation models burtenshaw • Nov 26, 2024 • 33
view article Article Share your open ML datasets on Hugging Face Hub! +2 davanstrien, cfahlgren1, lhoestq, erinys • Nov 12, 2024 • 32
view article Article Scaling AI-based Data Processing with Hugging Face + Dask +2 scj13, jrbourbeau, lhoestq, davanstrien • Oct 9, 2024 • 33
view article Article Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation davanstrien • Jun 20, 2024 • 12
view article Article Data Is Better Together: A Look Back and Forward +1 davanstrien, davidberenstein1957, sdiazlor • Jun 20, 2024 • 20
view article Article Synthetic dataset generation techniques: generating custom sentence similarity data davanstrien • May 23, 2024 • 16
view article Article Synthetic dataset generation techniques: Self-Instruct davanstrien • May 15, 2024 • 23
view article Article Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia? davanstrien • May 7, 2024 • 8
view article Article Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models +1 loubnabnl, anton-l, davanstrien • Mar 20, 2024 • 113
view article Article Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models +1 loubnabnl, anton-l, davanstrien • Mar 20, 2024 • 113