Submitted by
Guilherme Penedo
Science Team releasing large scale pre-training datasets to accelerate open LLM development.\n
View all Papers
- \n
- π· FineWeb: A 15T tokens English dataset for LLM pre-training. See the blogpost and paper. \n
- π FineWeb-Edu: a filtered subset of the most educational content from FineWeb. \n
- π₯ FineWeb2: an extension of FineWeb to over 1000 languages. See the paper. \n
- π FinePDFs: 3T tokens of text data extracted from PDFs sourced from the Web. See the blogpost \n
- π FineWiki: an updated, better extracted version of Wikipedia in 300+ languages. \n
- π FinePDFs-Edu: 350B+ highly educational tokens filtered from π FinePDFs \n
- π¬ FineTranslations: 1+1T tokens of parallel text translated from 500+ π₯ FineWeb2 languages \n
AI & ML interests
We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)
Recent Activity
Papers
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale