\n
[English below]
\n📚 Corpus de PLN en español
\nQueremos modelos que entiendan y hablen el español de las 600M personas hispanohablantes. ¿Nos ayudas?
\nEstamos recolectando datasets de diferentes países, registros y dominios. ¡Cuantas más variedades de la lengua, mejor! También son válidos datasets de audio e imágenes, así como datasets de idiomas cercanos al español (e.g., catalán, quechua).
\nPuedes colaborar aportando un enlace a un dataset ya existente, traduciendo uno del inglés o creando uno tú. ¡Toda la ayuda es bienvenida! 🚀
\n➡️ Lee la guía de contribución, elige un issue, ¡y a por ello!
\nSi tienes cualquier duda estamos a tu disposición en Discord.
\n¡Muchas gracias por apoyarnos en nuestra misión de democratizar el PLN en español!
\n\n
📚 NLP Corpus in Spanish
\nWe want AI models to understand and speak Spanish as the 600M Spanish speakers in the world. Are you ready to help us?
\nWe are collecting datasets from different countries, registers, and domains. The more varieties of the language, the better! Datasets of audio and images are also welcome, as well as datasets of languages close to Spanish (e.g., Catalan, Quechua).
\nYou can contribute by providing a link to an existing dataset, translating one from English, or creating one yourself. All help is welcome! 🚀
\n","classNames":"hf-sanitized hf-sanitized-TsPtlTT8dvDBI2MmZFeog"},"users":[{"_id":"5f9c00a5777efc07d7f1e4be","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665073337782-5f9c00a5777efc07d7f1e4be.png","isPro":true,"fullname":"María Grandury","user":"mariagrandury","type":"user"},{"_id":"6418c56b5d6f3d15c64f6c1a","avatarUrl":"/avatars/362fe4955ea9217363787cde1f251faa.svg","isPro":false,"fullname":"ptwo","user":"vtwo","type":"user"},{"_id":"651c94cc0c0c6b8fc85dbf14","avatarUrl":"/avatars/a40bec971ed9fc9f9554412e30bb8c5f.svg","isPro":false,"fullname":"Oscar Felipe Betancur Lopera","user":"felipebetancur","type":"user"},{"_id":"61845a4554d7c0915b6a1a3d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61845a4554d7c0915b6a1a3d/oGGrs8g7gqJ4Z2_qtfSJD.jpeg","isPro":false,"fullname":"Fredy Alberto Orozco Loaiza","user":"Frorozcol","type":"user"},{"_id":"645273ecbeba452314faf561","avatarUrl":"/avatars/24b508fc3b89cdf53852a25bef4341c1.svg","isPro":false,"fullname":"Lucas Biagetti","user":"lucasbiagettia","type":"user"},{"_id":"636ad58e4d1bade3ca8745b6","avatarUrl":"/avatars/465587b343011f86c75e93f0c65cd5c1.svg","isPro":false,"fullname":"Per Mathiesen","user":"MPR","type":"user"},{"_id":"6621025b52cfb16741752836","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6621025b52cfb16741752836/s-jSShRfTrOad__65onGL.png","isPro":false,"fullname":"wave","user":"wave-on-discord","type":"user"}],"userCount":7,"collections":[],"datasets":[{"author":"hacktoberfest-corpus-es","downloads":19,"gated":false,"id":"hacktoberfest-corpus-es/colmbian_spanish_news","lastModified":"2023-10-10T15:31:25.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":76151,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["parquet"],"modalities":["tabular","text"]},"private":false,"repoType":"dataset","likes":1,"isLikedByUser":false,"isBenchmark":false},{"author":"hacktoberfest-corpus-es","downloads":16,"gated":false,"id":"hacktoberfest-corpus-es/newyorker_caption_contest_spanish","lastModified":"2023-10-10T04:22:41.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":2601,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["parquet"],"modalities":["image","text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"author":"hacktoberfest-corpus-es","downloads":13,"gated":false,"id":"hacktoberfest-corpus-es/spanish_dish_instruction","lastModified":"2023-10-07T23:08:06.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":5521,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["parquet"],"modalities":["image","text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"author":"hacktoberfest-corpus-es","downloads":15,"gated":false,"id":"hacktoberfest-corpus-es/spanish_dish_title","lastModified":"2023-10-06T14:44:41.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":16463,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["parquet"],"modalities":["image","text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false}],"models":[],"paperPreviews":[],"spaces":[],"buckets":[],"numBuckets":0,"numDatasets":4,"numModels":0,"numSpaces":1,"lastOrgActivities":[{"time":"2026-02-19T14:47:09.839Z","user":"mariagrandury","userAvatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665073337782-5f9c00a5777efc07d7f1e4be.png","type":"paper","paper":{"id":"2510.10159","title":"BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible\n Training Data","publishedAt":"2025-10-11T10:50:47.000Z","upvotes":3,"isUpvotedByUser":true}},{"time":"2026-02-19T14:47:07.829Z","user":"mariagrandury","userAvatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665073337782-5f9c00a5777efc07d7f1e4be.png","type":"paper","paper":{"id":"2511.04703","title":"Measuring what Matters: Construct Validity in Large Language Model\n Benchmarks","publishedAt":"2025-11-03T17:39:40.000Z","upvotes":8,"isUpvotedByUser":true}},{"time":"2025-09-19T13:09:41.052Z","user":"mariagrandury","userAvatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665073337782-5f9c00a5777efc07d7f1e4be.png","type":"paper","paper":{"id":"2509.14405","title":"Adding LLMs to the psycholinguistic norming toolbox: A practical guide\n to getting the most out of human ratings","publishedAt":"2025-09-17T20:11:23.000Z","upvotes":2,"isUpvotedByUser":true}}],"acceptLanguages":["*"],"canReadRepos":false,"canReadSpaces":false,"blogPosts":[],"currentRepoPage":0,"filters":{},"paperView":false}">AI & ML interests
Spanish Corpus
Recent Activity
🔥 Open-source corpus ES: https://somosnlp.org/recursos/datasets 🔥
[English below]
📚 Corpus de PLN en español
Queremos modelos que entiendan y hablen el español de las 600M personas hispanohablantes. ¿Nos ayudas?
Estamos recolectando datasets de diferentes países, registros y dominios. ¡Cuantas más variedades de la lengua, mejor! También son válidos datasets de audio e imágenes, así como datasets de idiomas cercanos al español (e.g., catalán, quechua).
Puedes colaborar aportando un enlace a un dataset ya existente, traduciendo uno del inglés o creando uno tú. ¡Toda la ayuda es bienvenida! 🚀
➡️ Lee la guía de contribución, elige un issue, ¡y a por ello!
Si tienes cualquier duda estamos a tu disposición en Discord.
¡Muchas gracias por apoyarnos en nuestra misión de democratizar el PLN en español!
📚 NLP Corpus in Spanish
We want AI models to understand and speak Spanish as the 600M Spanish speakers in the world. Are you ready to help us?
We are collecting datasets from different countries, registers, and domains. The more varieties of the language, the better! Datasets of audio and images are also welcome, as well as datasets of languages close to Spanish (e.g., Catalan, Quechua).
You can contribute by providing a link to an existing dataset, translating one from English, or creating one yourself. All help is welcome! 🚀