Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
opendatalab (OpenDataLab)
[go: Go Back, main page]

\n\n**English**🌎|[简体中文](https://github.com/opendatalab/opendatalab-datasets/blob/main/introduction%20CN.md)🀄 \n\n> [!NOTE] \n> 📚 In 2025, we have open-sourced a high-quality multilingual dataset, **WanJuan 3.0 (WanJuan Silu)**\n>
\n>
\n> **🧾 ​​January 2025: Initial Release of Multilingual Pre-training Corpus​​**:\n> Primarily text-based data.Collected publicly available web content, literature, patents, and more from 5 countries/regions.Total data size exceeds ​​1.2TB​​, with ​​300 billion tokens​​, achieving international leadership.The initial release includes ​​Thai, Russian, Arabic, Korean, and Vietnamese​​ sub-corpora, each exceeding ​​150GB​​.Leveraging the ​​\"InternLM\" Intelligent Tagging System​​, the research team categorized each sub-corpus into ​​7 major classes​​ (e.g., history, politics, culture, real estate, shopping, weather, dining, encyclopedias, professional knowledge) and ​​32 sub-classes​​, ensuring localized linguistic and cultural relevance.Designed for researchers to easily retrieve data for diverse needs.\n>
\n> ​​Download Links​​: [Russian](https://opendatalab.com/OpenDataLab/WanJuan-Russian) • [Arabic](https://opendatalab.com/OpenDataLab/WanJuan-Arabic) • [Korean](https://opendatalab.com/OpenDataLab/WanJuan-Korean) • [Vietnamese](https://opendatalab.com/OpenDataLab/WanJuan-Vietnamese) • [Thai](https://opendatalab.com/OpenDataLab/WanJuan-Thai).\n>
\n>
\n> **🌏 ​​March 2025: Second Release of Multilingual Multimodal Corpus​​**:\n> which comprises over 1.2TB of indigenous textual corpora from five countries. Each subset includes seven major categories and 34 subcategories, covering a wide range of local characteristics, such as history, politics, culture, real estate, shopping, weather, dining, encyclopedic knowledge, and professional expertise. Here are the download links for the five subsets, and we welcome everyone to download and use them.\n> \n> Comprises ​​4 data types​​:\n>- Image-Text​​: Over ​​2 million images​​ (raw size: 362.174GB).\n>- Audio-Text​​: ​​200 hours​​ of ultra-high-precision annotated audio per language.\n>- Video-Text​​: Over ​​8 million video clips​​ (raw duration: 28,000+ hours; refined to 16,000+ hours of high-quality content).\n>- Localized SFT (Supervised Fine-Tuning)​​:184,000 SFT entries​​ covering local culture, daily conversations, code, mathematics, and science.​​23,000 entries per language​​, including ​​3,000 culturally unique Q&A pairs designed by local residents​​ and ​​20,000 translated entries​​ filtered through a quality-check pipeline combining rules and model scoring.Covers ​​8 languages​​ across ​​4 modalities​​, totaling ​​11.5 million entries​​, refined to industrial-grade quality for \"ready-to-use\" applications.\n>
Download Links​​: [5 languages (Arabic, Russian, Korean, Vietnamese, Thai)](https://opendatalab.com/OpenDataLab/WanJuanSiLu2O) • [3 languages (Serbian, Hungarian, Czech)](https://opendatalab.com/OpenDataLab/WanJuanSiLu2).\n\n---\n\n**🔥🔥🔥OpenDataLab Provide ecology for high-quality datasets for community.** It provides:\n\n# 🌟Extensive open data resources for AI Model\n● High-speed and simple way to access open datasets \n● 7700+ Large scale and high-quality open datasets for large model \n● 1200+ Open datasets for Computer Vision\n
\n● 200+ Open datasets by CVPR \n● Categorized datasets for hot topics \n\n# ✨Open-source data processing toolkits\n● Data acquisition toolkits supporting large datasets \n● Data acquisition toolkits supporting kinds of tasks \n● Open source intelligent Toolbox for Labeling\n\n\n# 💫Dataset description language\n● Format standardization \n● DSDL: Dataset Description Language \n● Define a CV dataset by DSDL \n● OpenDataLab Standardized 100+ CV Datasets \n\nCheck our [tutorials videos](https://www.youtube.com/watch?v=LjbRt7uddyw) (in Chinese) to get started.\n\n---\n\n📣 We have upgraded and launched the function of authors uploading datasets independently. We hereby invite you to participate in using it to better promote your open source datasets, AI research results, etc., so that more people can access, obtain and use your dataset. \n\nThis is an introduction to the dataset autonomous upload function [【help doc】](https://github.com/opendatalab/opendatalab-datasets/blob/main/help%20doc.md),You can create and share your dataset according to our guidelines. 💪\n\nIf you have any questions or obstacles, please feel free to contact us OpenDataLab@pjlab.org.cn. \n\n[![](https://github.com/opendatalab/opendatalab-datasets/blob/main/%E9%A1%B6%E4%BC%9A%E9%A1%B6%E5%88%8A%E6%95%B0%E6%8D%AE%E9%9B%86/ECCV/img/create%20your%20dataset.png?raw=true)](https://opendatalab.com/create?source=R2l0aHVi)\n","html":"
\n

\n \n
\n \n \"GitHub\n \n \"Citation-OpenDataLab\"\n \n \n

🏡 Homepage\n\n 👋 Discord\n\n 💬 WeChat Group\n

\n

\n
\n\n

English🌎|简体中文🀄

\n
\n

📚 In 2025, we have open-sourced a high-quality multilingual dataset, WanJuan 3.0 (WanJuan Silu)\n
\n
\n🧾 ​​January 2025: Initial Release of Multilingual Pre-training Corpus​​:\nPrimarily text-based data.Collected publicly available web content, literature, patents, and more from 5 countries/regions.Total data size exceeds ​​1.2TB​​, with ​​300 billion tokens​​, achieving international leadership.The initial release includes ​​Thai, Russian, Arabic, Korean, and Vietnamese​​ sub-corpora, each exceeding ​​150GB​​.Leveraging the ​​\"InternLM\" Intelligent Tagging System​​, the research team categorized each sub-corpus into ​​7 major classes​​ (e.g., history, politics, culture, real estate, shopping, weather, dining, encyclopedias, professional knowledge) and ​​32 sub-classes​​, ensuring localized linguistic and cultural relevance.Designed for researchers to easily retrieve data for diverse needs.\n
\n​​Download Links​​: RussianArabicKoreanVietnameseThai.\n
\n

🌏 ​​March 2025: Second Release of Multilingual Multimodal Corpus​​:\nwhich comprises over 1.2TB of indigenous textual corpora from five countries. Each subset includes seven major categories and 34 subcategories, covering a wide range of local characteristics, such as history, politics, culture, real estate, shopping, weather, dining, encyclopedic knowledge, and professional expertise. Here are the download links for the five subsets, and we welcome everyone to download and use them.

\n

Comprises ​​4 data types​​:

\n
    \n
  • Image-Text​​: Over ​​2 million images​​ (raw size: 362.174GB).
  • \n
  • Audio-Text​​: ​​200 hours​​ of ultra-high-precision annotated audio per language.
  • \n
  • Video-Text​​: Over ​​8 million video clips​​ (raw duration: 28,000+ hours; refined to 16,000+ hours of high-quality content).
  • \n
  • Localized SFT (Supervised Fine-Tuning)​​:184,000 SFT entries​​ covering local culture, daily conversations, code, mathematics, and science.​​23,000 entries per language​​, including ​​3,000 culturally unique Q&A pairs designed by local residents​​ and ​​20,000 translated entries​​ filtered through a quality-check pipeline combining rules and model scoring.Covers ​​8 languages​​ across ​​4 modalities​​, totaling ​​11.5 million entries​​, refined to industrial-grade quality for \"ready-to-use\" applications.\n
    Download Links​​: 5 languages (Arabic, Russian, Korean, Vietnamese, Thai)3 languages (Serbian, Hungarian, Czech).
  • \n
\n
\n
\n

🔥🔥🔥OpenDataLab Provide ecology for high-quality datasets for community. It provides:

\n

🌟Extensive open data resources for AI Model

\n

● High-speed and simple way to access open datasets
● 7700+ Large scale and high-quality open datasets for large model
● 1200+ Open datasets for Computer Vision\n
\n● 200+ Open datasets by CVPR
● Categorized datasets for hot topics

\n

✨Open-source data processing toolkits

\n

● Data acquisition toolkits supporting large datasets
● Data acquisition toolkits supporting kinds of tasks
● Open source intelligent Toolbox for Labeling

\n

💫Dataset description language

\n

● Format standardization
● DSDL: Dataset Description Language
● Define a CV dataset by DSDL
● OpenDataLab Standardized 100+ CV Datasets

\n

Check our tutorials videos (in Chinese) to get started.

\n
\n

📣 We have upgraded and launched the function of authors uploading datasets independently. We hereby invite you to participate in using it to better promote your open source datasets, AI research results, etc., so that more people can access, obtain and use your dataset.

\n

This is an introduction to the dataset autonomous upload function 【help doc】,You can create and share your dataset according to our guidelines. 💪

\n

If you have any questions or obstacles, please feel free to contact us OpenDataLab@pjlab.org.cn.

\n

\"\"

\n","classNames":"hf-sanitized hf-sanitized-Chg-28A_Rqz1ipJp8LEE2"},"users":[{"_id":"639c3afa7432f2f5d16b7296","avatarUrl":"/avatars/2357cecb435b504977ace46d4967c05a.svg","isPro":false,"fullname":"focus","user":"SFKs","type":"user"},{"_id":"63ae9ff5557befe297a76f90","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1672388558183-noauth.jpeg","isPro":false,"fullname":"Bin Wang","user":"wanderkid","type":"user"},{"_id":"668f77a3b3991ac0c308a441","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/668f77a3b3991ac0c308a441/w5-9UAxi32HnrTjW9_fGC.jpeg","isPro":false,"fullname":"xiaomeng zhao","user":"myhloli","type":"user"},{"_id":"6407f7e6656adc27c7727e3a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678352115653-6407f7e6656adc27c7727e3a.jpeg","isPro":false,"fullname":"wufan","user":"wufan","type":"user"},{"_id":"66753c556f2ac48ee625d7d1","avatarUrl":"/avatars/8f7c252675fd8a096794d12971903722.svg","isPro":false,"fullname":"Linke Ouyang","user":"ouyanglinke","type":"user"},{"_id":"62581ef5d47cce537ee25963","avatarUrl":"/avatars/d987c625791dc33c84bf9b51069bf2c8.svg","isPro":false,"fullname":"Ren Ma","user":"renma","type":"user"},{"_id":"643e60d96db6ba8c5ee177ad","avatarUrl":"/avatars/73ac7740e462ba0b53a2f2480d9f1e3e.svg","isPro":false,"fullname":"Lijun Wu","user":"apeters","type":"user"},{"_id":"66728659dc0cf94709b98827","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66728659dc0cf94709b98827/t5CNht2uGSm1AsPgmz2i-.jpeg","isPro":false,"fullname":"Wayne","user":"waynesapphire","type":"user"},{"_id":"6625ef13605f46d05c1d0031","avatarUrl":"/avatars/22f201dca35e43013cb593884516e96c.svg","isPro":false,"fullname":"Zheng Liu","user":"starriver030515","type":"user"},{"_id":"6863c24fb152f2dc74bbb25c","avatarUrl":"/avatars/388de67b01c26e97d80755ad35c24660.svg","isPro":false,"fullname":"cxz","user":"Sidney233","type":"user"},{"_id":"65b8c55130839a0db8cdc496","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65b8c55130839a0db8cdc496/REpRsa1ts84yjk1GyyfnT.png","isPro":false,"fullname":"Tianyao He","user":"hotelll","type":"user"},{"_id":"67cc4e8b547e3ec05e360d1b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/lZ87W6j2I_Jhu8IM9J297.png","isPro":false,"fullname":"yuan","user":"ququyuan","type":"user"},{"_id":"6704e798af4f714f758df716","avatarUrl":"/avatars/55206866d2085ee1b13fc09811ccf54d.svg","isPro":false,"fullname":"Haojiong Chen","user":"chj1229","type":"user"},{"_id":"64ae34903c80c402fbe95845","avatarUrl":"/avatars/cf463267f06ec56091c4ee80fa9bd46d.svg","isPro":false,"fullname":"MA Runyuan","user":"RayneMatt","type":"user"},{"_id":"641bbd19af42e9b7dd4638ce","avatarUrl":"/avatars/8e7819fa76650cab30e60063e8d3a217.svg","isPro":false,"fullname":"Qiu Jiantao","user":"qiujiantao","type":"user"},{"_id":"66442a49bfe15e84d3740d53","avatarUrl":"/avatars/47093fe8114b92f210e52faf6aef1183.svg","isPro":false,"fullname":"DataEval","user":"DataEval","type":"user"}],"userCount":16,"collections":[{"slug":"opendatalab/chartverse-696e009576ca6886670c8d02","title":"ChartVerse","description":"","gating":false,"lastUpdated":"2026-01-25T16:08:40.693Z","owner":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"items":[{"_id":"696e00a29fc129310129e6c4","position":0,"type":"model","author":"opendatalab","authorData":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"downloads":28,"gated":false,"id":"opendatalab/ChartVerse-Coder","availableInferenceProviders":[],"lastModified":"2026-01-21T03:24:48.000Z","likes":11,"pipeline_tag":"text-generation","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[]},{"_id":"696e00afe7a76925b9600776","position":1,"type":"model","author":"opendatalab","authorData":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"downloads":53,"gated":false,"id":"opendatalab/ChartVerse-2B","availableInferenceProviders":[],"lastModified":"2026-01-21T03:25:16.000Z","likes":5,"pipeline_tag":"image-text-to-text","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[],"numParameters":2438696960},{"_id":"696e00b50791616123b7747e","position":2,"type":"model","author":"opendatalab","authorData":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"downloads":13,"gated":false,"id":"opendatalab/ChartVerse-4B","availableInferenceProviders":[],"lastModified":"2026-01-21T03:25:25.000Z","likes":4,"pipeline_tag":"image-text-to-text","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[],"numParameters":4826771968},{"_id":"696e00ba85619ece0df971bc","position":3,"type":"model","author":"opendatalab","authorData":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"downloads":62,"gated":false,"id":"opendatalab/ChartVerse-8B","availableInferenceProviders":[],"lastModified":"2026-01-21T03:25:33.000Z","likes":6,"pipeline_tag":"image-text-to-text","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[],"numParameters":8767123696}],"position":0,"theme":"green","private":false,"shareUrl":"https://hf.co/collections/opendatalab/chartverse","upvotes":7,"isUpvotedByUser":false}],"datasets":[{"author":"opendatalab","downloads":6038,"gated":false,"id":"opendatalab/ChartVerse-SFT-1.8M","lastModified":"2026-02-09T15:20:46.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":1883946,"libraries":["datasets","dask","polars","mlcroissant"],"formats":["parquet"],"modalities":["image","text"]},"private":false,"repoType":"dataset","likes":134,"isLikedByUser":false,"isBenchmark":false},{"author":"opendatalab","downloads":2118,"gated":false,"id":"opendatalab/ScienceMetaBench","lastModified":"2026-01-23T06:03:10.000Z","datasetsServerInfo":{"viewer":"preview","numRows":0,"libraries":[],"formats":[],"modalities":["document"]},"private":false,"repoType":"dataset","likes":115,"isLikedByUser":false,"isBenchmark":false},{"author":"opendatalab","downloads":2737,"gated":false,"id":"opendatalab/ChartVerse-SFT-600K","lastModified":"2026-01-23T03:20:07.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":605967,"libraries":["datasets","dask","polars","mlcroissant"],"formats":["parquet"],"modalities":["image","text"]},"private":false,"repoType":"dataset","likes":10,"isLikedByUser":false,"isBenchmark":false},{"author":"opendatalab","downloads":160,"gated":false,"id":"opendatalab/ChartVerse-RL-40K","lastModified":"2026-01-21T03:26:15.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":40000,"libraries":["datasets","pandas","polars","mlcroissant"],"formats":["parquet"],"modalities":["image","text"]},"private":false,"repoType":"dataset","likes":9,"isLikedByUser":false,"isBenchmark":false},{"author":"opendatalab","downloads":1078256,"gated":false,"id":"opendatalab/AICC","lastModified":"2025-12-25T03:11:26.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":4823698158,"libraries":["datasets","dask","polars","mlcroissant"],"formats":["parquet"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":100,"isLikedByUser":false,"isBenchmark":false},{"author":"opendatalab","downloads":8228,"gated":false,"id":"opendatalab/OmniDocBench","lastModified":"2025-09-26T03:26:08.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":1358,"libraries":["datasets","mlcroissant"],"formats":["imagefolder"],"modalities":["image"]},"private":false,"repoType":"dataset","likes":69,"isLikedByUser":false,"isBenchmark":false},{"author":"opendatalab","downloads":258,"gated":false,"id":"opendatalab/OHR-Bench","lastModified":"2025-08-28T05:36:30.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":8561,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["parquet"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":16,"isLikedByUser":false,"isBenchmark":false},{"author":"opendatalab","downloads":830,"gated":false,"id":"opendatalab/awesome-markdown-ebooks","lastModified":"2025-07-03T10:13:27.000Z","private":false,"repoType":"dataset","likes":5,"isLikedByUser":false,"isBenchmark":false},{"author":"opendatalab","downloads":25,"gated":false,"id":"opendatalab/Meta-rater-PRRC-Rater-dataset","lastModified":"2025-06-16T07:03:45.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":934278,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["json"],"modalities":["tabular","text"]},"private":false,"repoType":"dataset","likes":1,"isLikedByUser":false,"isBenchmark":false},{"author":"opendatalab","downloads":53,"gated":false,"id":"opendatalab/SlimPajama-Meta-rater-Professionalism-30B","lastModified":"2025-06-15T13:14:40.000Z","datasetsServerInfo":{"viewer":"viewer-partial","numRows":1753970,"libraries":["datasets","dask","mlcroissant"],"formats":["json"],"modalities":["tabular","text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false}],"models":[{"author":"opendatalab","authorData":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"downloads":62,"gated":false,"id":"opendatalab/ChartVerse-8B","availableInferenceProviders":[],"lastModified":"2026-01-21T03:25:33.000Z","likes":6,"pipeline_tag":"image-text-to-text","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[],"numParameters":8767123696},{"author":"opendatalab","authorData":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"downloads":13,"gated":false,"id":"opendatalab/ChartVerse-4B","availableInferenceProviders":[],"lastModified":"2026-01-21T03:25:25.000Z","likes":4,"pipeline_tag":"image-text-to-text","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[],"numParameters":4826771968},{"author":"opendatalab","authorData":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"downloads":53,"gated":false,"id":"opendatalab/ChartVerse-2B","availableInferenceProviders":[],"lastModified":"2026-01-21T03:25:16.000Z","likes":5,"pipeline_tag":"image-text-to-text","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[],"numParameters":2438696960},{"author":"opendatalab","authorData":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"downloads":28,"gated":false,"id":"opendatalab/ChartVerse-Coder","availableInferenceProviders":[],"lastModified":"2026-01-21T03:24:48.000Z","likes":11,"pipeline_tag":"text-generation","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[]},{"author":"opendatalab","authorData":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"downloads":534,"gated":false,"id":"opendatalab/TRivia-3B","availableInferenceProviders":[],"lastModified":"2025-12-02T16:39:39.000Z","likes":8,"pipeline_tag":"image-text-to-text","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[],"numParameters":4065787904},{"author":"opendatalab","authorData":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"downloads":8649,"gated":false,"id":"opendatalab/MinerU-HTML","availableInferenceProviders":[],"lastModified":"2025-11-28T12:00:25.000Z","likes":40,"pipeline_tag":"text-generation","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[],"numParameters":751632384},{"author":"opendatalab","authorData":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"downloads":0,"gated":false,"id":"opendatalab/PDF-Extract-Kit-1.0","availableInferenceProviders":[],"lastModified":"2025-10-24T10:19:11.000Z","likes":108,"private":false,"repoType":"model","isLikedByUser":false},{"author":"opendatalab","authorData":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"downloads":122550,"gated":false,"id":"opendatalab/MinerU2.5-2509-1.2B","availableInferenceProviders":[],"lastModified":"2025-09-29T06:58:38.000Z","likes":326,"pipeline_tag":"image-text-to-text","private":false,"repoType":"model","isLikedByUser":false},{"author":"opendatalab","authorData":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"downloads":0,"gated":"auto","id":"opendatalab/belt_road_hungarian_beta1","availableInferenceProviders":[],"lastModified":"2025-09-09T06:43:53.000Z","likes":0,"private":false,"repoType":"model","isLikedByUser":false},{"author":"opendatalab","authorData":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"downloads":0,"gated":false,"id":"opendatalab/belt_road_hungarian_beta2","availableInferenceProviders":[],"lastModified":"2025-09-09T06:41:12.000Z","likes":0,"private":false,"repoType":"model","isLikedByUser":false}],"paperPreviews":[{"_id":"2512.01816","title":"Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights","id":"2512.01816","thumbnailUrl":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2512.01816.png"},{"_id":"2511.16397","title":"AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser","id":"2511.16397","thumbnailUrl":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2511.16397.png"}],"spaces":[{"author":"opendatalab","authorData":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"colorFrom":"purple","colorTo":"blue","createdAt":"2024-08-30T09:29:53.000Z","emoji":"📚","id":"opendatalab/MinerU","lastModified":"2026-02-03T20:15:11.000Z","likes":542,"pinned":true,"private":false,"sdk":"gradio","repoType":"space","runtime":{"stage":"RUNNING","hardware":{"current":"l40sx1","requested":"l40sx1"},"storage":null,"gcTimeout":3600,"replicas":{"current":1,"requested":1},"devMode":false,"domains":[{"domain":"opendatalab-mineru.hf.space","stage":"READY"}],"sha":"6850e805ffa23a043453ebffa2ba0872fdd92208"},"shortDescription":"A data extraction tool to convert PDF to Markdown and JSON","title":"MinerU OCR","isLikedByUser":false,"ai_short_description":"Convert PDFs to clean text with LaTeX support","ai_category":"Document Analysis","trendingScore":1,"tags":["gradio","region:us"],"featured":false},{"author":"opendatalab","authorData":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"colorFrom":"purple","colorTo":"green","createdAt":"2025-12-02T12:23:13.000Z","emoji":"⭐","id":"opendatalab/TRivia-3B","lastModified":"2025-12-06T11:46:19.000Z","likes":2,"pinned":false,"private":false,"sdk":"gradio","repoType":"space","runtime":{"stage":"PAUSED","hardware":{"current":null,"requested":"cpu-basic"},"storage":null,"gcTimeout":172800,"replicas":{"requested":1},"devMode":false,"domains":[{"domain":"opendatalab-trivia-3b.hf.space","stage":"READY"}]},"shortDescription":"Convert table images into HTML tags with TRivia-3B","title":"TRivia-3B","isLikedByUser":false,"ai_short_description":"Generate HTML tables from images","ai_category":"Image To Table Conversion","trendingScore":0,"tags":["gradio","region:us"],"featured":false},{"author":"opendatalab","authorData":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"colorFrom":"indigo","colorTo":"indigo","createdAt":"2024-09-05T06:27:27.000Z","emoji":"📈","id":"opendatalab/CDM-Demo","lastModified":"2025-09-28T10:23:35.000Z","likes":9,"pinned":false,"private":false,"sdk":"docker","repoType":"space","runtime":{"stage":"SLEEPING","hardware":{"current":null,"requested":"cpu-basic"},"storage":null,"gcTimeout":172800,"replicas":{"requested":1},"devMode":false,"domains":[{"domain":"opendatalab-cdm-demo.hf.space","stage":"READY"}]},"title":"CDM","isLikedByUser":false,"ai_short_description":"Evaluate formula recognition accuracy","ai_category":"Text Analysis","trendingScore":0,"tags":["docker","region:us"],"featured":false},{"author":"opendatalab","authorData":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"colorFrom":"yellow","colorTo":"blue","createdAt":"2024-10-21T08:12:43.000Z","emoji":"🚀","id":"opendatalab/DocLayout-YOLO","lastModified":"2025-09-08T03:34:31.000Z","likes":187,"pinned":false,"private":false,"sdk":"gradio","repoType":"space","runtime":{"stage":"RUNTIME_ERROR","hardware":{"current":null,"requested":"zero-a10g"},"storage":null,"gcTimeout":172800,"errorMessage":"Exit code: 1. Reason: t):\n File \"/home/user/app/app.py\", line 5, in \n import spaces\n File \"/usr/local/lib/python3.10/site-packages/spaces/__init__.py\", line 17, in \n from .zero.decorator import GPU\n File \"/usr/local/lib/python3.10/site-packages/spaces/zero/__init__.py\", line 9, in \n from . import client\n File \"/usr/local/lib/python3.10/site-packages/spaces/zero/client.py\", line 9, in \n import gradio as gr\n File \"/usr/local/lib/python3.10/site-packages/gradio/__init__.py\", line 3, in \n import gradio._simple_templates\n File \"/usr/local/lib/python3.10/site-packages/gradio/_simple_templates/__init__.py\", line 1, in \n from .simpledropdown import SimpleDropdown\n File \"/usr/local/lib/python3.10/site-packages/gradio/_simple_templates/simpledropdown.py\", line 7, in \n from gradio.components.base import Component, FormComponent\n File \"/usr/local/lib/python3.10/site-packages/gradio/components/__init__.py\", line 1, in \n from gradio.components.annotated_image import AnnotatedImage\n File \"/usr/local/lib/python3.10/site-packages/gradio/components/annotated_image.py\", line 15, in \n from gradio.components.base import Component\n File \"/usr/local/lib/python3.10/site-packages/gradio/components/base.py\", line 21, in \n from gradio.blocks import Block, BlockContext\n File \"/usr/local/lib/python3.10/site-packages/gradio/blocks.py\", line 36, in \n from gradio import (\n File \"/usr/local/lib/python3.10/site-packages/gradio/networking.py\", line 15, in \n from gradio.routes import App # HACK: to avoid circular import # noqa: F401\n File \"/usr/local/lib/python3.10/site-packages/gradio/routes.py\", line 83, in \n from gradio.oauth import attach_oauth\n File \"/usr/local/lib/python3.10/site-packages/gradio/oauth.py\", line 13, in \n from huggingface_hub import HfFolder, whoami\nImportError: cannot import name 'HfFolder' from 'huggingface_hub' (/usr/local/lib/python3.10/site-packages/huggingface_hub/__init__.py)\n","replicas":{"requested":1},"devMode":false,"domains":[{"domain":"opendatalab-doclayout-yolo.hf.space","stage":"READY"}]},"shortDescription":"Demo for DocLayout-YOLO","title":"DocLayout YOLO","isLikedByUser":false,"ai_short_description":"Recognize document layout in images","ai_category":"Object Detection","trendingScore":0,"tags":["gradio","region:us"],"featured":true},{"author":"opendatalab","authorData":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"colorFrom":"purple","colorTo":"blue","createdAt":"2024-09-06T12:52:59.000Z","emoji":"👁","id":"opendatalab/UniMERNet-Demo","lastModified":"2024-09-19T12:26:57.000Z","likes":13,"pinned":false,"private":false,"sdk":"gradio","repoType":"space","runtime":{"stage":"RUNTIME_ERROR","hardware":{"current":null,"requested":"zero-a10g"},"storage":null,"gcTimeout":172800,"errorMessage":"Exit code: 1. Reason: odule>\n torch.manual_seed = disable(torch.manual_seed)\n File \"/usr/local/lib/python3.10/site-packages/torch/_dynamo/decorators.py\", line 50, in disable\n return DisableContext()(fn)\n File \"/usr/local/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py\", line 410, in __call__\n (filename is None or trace_rules.check(fn))\n File \"/usr/local/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py\", line 3378, in check\n return check_verbose(obj, is_inlined_call).skipped\n File \"/usr/local/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py\", line 3361, in check_verbose\n rule = torch._dynamo.trace_rules.lookup_inner(\n File \"/usr/local/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py\", line 3442, in lookup_inner\n rule = get_torch_obj_rule_map().get(obj, None)\n File \"/usr/local/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py\", line 2782, in get_torch_obj_rule_map\n obj = load_object(k)\n File \"/usr/local/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py\", line 2811, in load_object\n val = _load_obj_from_str(x[0])\n File \"/usr/local/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py\", line 2795, in _load_obj_from_str\n return getattr(importlib.import_module(module), obj_name)\n File \"/usr/local/lib/python3.10/importlib/__init__.py\", line 126, in import_module\n return _bootstrap._gcd_import(name[level:], package, level)\n File \"/usr/local/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py\", line 419, in \n ).detach()\n File \"/usr/local/lib/python3.10/site-packages/spaces/zero/torch/patching.py\", line 200, in __torch_function__\n return transform_subclass(wrapper_tensor, lambda _, t: func(t))\n File \"/usr/local/lib/python3.10/site-packages/torch/utils/_python_dispatch.py\", line 260, in transform_subclass\n assert sub.shape == outer_size, \\\nAssertionError: Expected return value from __tensor_unflatten__() to have shape equal to (1, j0, 3), but got: (1, j1, 3)\n","replicas":{"requested":1},"devMode":false,"domains":[{"domain":"opendatalab-unimernet-demo.hf.space","stage":"READY"}]},"title":"UniMERNet","isLikedByUser":false,"ai_short_description":"Recognize math equations from images","ai_category":"Image Captioning","trendingScore":0,"tags":["gradio","region:us"],"featured":false}],"buckets":[],"numBuckets":0,"numDatasets":24,"numModels":26,"numSpaces":6,"lastOrgActivities":[{"time":"2026-02-09T15:20:46.314Z","user":"starriver030515","userAvatarUrl":"/avatars/22f201dca35e43013cb593884516e96c.svg","orgAvatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","type":"update","repoData":{"author":"opendatalab","downloads":6038,"gated":false,"id":"opendatalab/ChartVerse-SFT-1.8M","lastModified":"2026-02-09T15:20:46.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":1883946,"libraries":["datasets","dask","polars","mlcroissant"],"formats":["parquet"],"modalities":["image","text"]},"private":false,"repoType":"dataset","likes":134,"isLikedByUser":false,"isBenchmark":false},"repoId":"opendatalab/ChartVerse-SFT-1.8M","repoType":"dataset","org":"opendatalab"},{"time":"2026-02-03T20:07:55.307Z","user":"myhloli","userAvatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/668f77a3b3991ac0c308a441/w5-9UAxi32HnrTjW9_fGC.jpeg","orgAvatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","type":"update","repoData":{"author":"opendatalab","authorData":{"_id":"66ce9d1f5e180b9b9c8e6f31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png","fullname":"OpenDataLab","name":"opendatalab","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":543,"isUserFollowing":false},"colorFrom":"purple","colorTo":"blue","createdAt":"2024-08-30T09:29:53.000Z","emoji":"📚","id":"opendatalab/MinerU","lastModified":"2026-02-03T20:15:11.000Z","likes":542,"pinned":true,"private":false,"sdk":"gradio","repoType":"space","runtime":{"stage":"RUNNING","hardware":{"current":"l40sx1","requested":"l40sx1"},"storage":null,"gcTimeout":3600,"replicas":{"current":1,"requested":1},"devMode":false,"domains":[{"domain":"opendatalab-mineru.hf.space","stage":"READY"}],"sha":"6850e805ffa23a043453ebffa2ba0872fdd92208"},"shortDescription":"A data extraction tool to convert PDF to Markdown and JSON","title":"MinerU OCR","isLikedByUser":false,"ai_short_description":"Convert PDFs to clean text with LaTeX support","ai_category":"Document Analysis","trendingScore":1,"tags":["gradio","region:us"],"featured":false},"repoId":"opendatalab/MinerU","repoType":"space","org":"opendatalab"},{"time":"2026-02-02T17:01:29.403Z","user":"apeters","userAvatarUrl":"/avatars/73ac7740e462ba0b53a2f2480d9f1e3e.svg","type":"paper","paper":{"id":"2601.07296","title":"LRAS: Advanced Legal Reasoning with Agentic Search","publishedAt":"2026-01-12T08:07:35.000Z","upvotes":3,"isUpvotedByUser":true}}],"acceptLanguages":["*"],"canReadRepos":false,"canReadSpaces":false,"blogPosts":[],"currentRepoPage":0,"filters":{},"paperView":false}">

AI & ML interests

OpenDataLab provides high-quality open datasets and tools for large models. China Large model corpus Data Alliance open source data service designated platform

Recent Activity

English🌎|简体中文🀄

📚 In 2025, we have open-sourced a high-quality multilingual dataset, WanJuan 3.0 (WanJuan Silu)

🧾 ​​January 2025: Initial Release of Multilingual Pre-training Corpus​​: Primarily text-based data.Collected publicly available web content, literature, patents, and more from 5 countries/regions.Total data size exceeds ​​1.2TB​​, with ​​300 billion tokens​​, achieving international leadership.The initial release includes ​​Thai, Russian, Arabic, Korean, and Vietnamese​​ sub-corpora, each exceeding ​​150GB​​.Leveraging the ​​"InternLM" Intelligent Tagging System​​, the research team categorized each sub-corpus into ​​7 major classes​​ (e.g., history, politics, culture, real estate, shopping, weather, dining, encyclopedias, professional knowledge) and ​​32 sub-classes​​, ensuring localized linguistic and cultural relevance.Designed for researchers to easily retrieve data for diverse needs.
​​Download Links​​: RussianArabicKoreanVietnameseThai.


🌏 ​​March 2025: Second Release of Multilingual Multimodal Corpus​​: which comprises over 1.2TB of indigenous textual corpora from five countries. Each subset includes seven major categories and 34 subcategories, covering a wide range of local characteristics, such as history, politics, culture, real estate, shopping, weather, dining, encyclopedic knowledge, and professional expertise. Here are the download links for the five subsets, and we welcome everyone to download and use them.

Comprises ​​4 data types​​:

  • Image-Text​​: Over ​​2 million images​​ (raw size: 362.174GB).
  • Audio-Text​​: ​​200 hours​​ of ultra-high-precision annotated audio per language.
  • Video-Text​​: Over ​​8 million video clips​​ (raw duration: 28,000+ hours; refined to 16,000+ hours of high-quality content).
  • Localized SFT (Supervised Fine-Tuning)​​:184,000 SFT entries​​ covering local culture, daily conversations, code, mathematics, and science.​​23,000 entries per language​​, including ​​3,000 culturally unique Q&A pairs designed by local residents​​ and ​​20,000 translated entries​​ filtered through a quality-check pipeline combining rules and model scoring.Covers ​​8 languages​​ across ​​4 modalities​​, totaling ​​11.5 million entries​​, refined to industrial-grade quality for "ready-to-use" applications.
    Download Links​​: 5 languages (Arabic, Russian, Korean, Vietnamese, Thai)3 languages (Serbian, Hungarian, Czech).

🔥🔥🔥OpenDataLab Provide ecology for high-quality datasets for community. It provides:

🌟Extensive open data resources for AI Model

● High-speed and simple way to access open datasets
● 7700+ Large scale and high-quality open datasets for large model
● 1200+ Open datasets for Computer Vision
● 200+ Open datasets by CVPR
● Categorized datasets for hot topics

✨Open-source data processing toolkits

● Data acquisition toolkits supporting large datasets
● Data acquisition toolkits supporting kinds of tasks
● Open source intelligent Toolbox for Labeling

💫Dataset description language

● Format standardization
● DSDL: Dataset Description Language
● Define a CV dataset by DSDL
● OpenDataLab Standardized 100+ CV Datasets

Check our tutorials videos (in Chinese) to get started.


📣 We have upgraded and launched the function of authors uploading datasets independently. We hereby invite you to participate in using it to better promote your open source datasets, AI research results, etc., so that more people can access, obtain and use your dataset.

This is an introduction to the dataset autonomous upload function 【help doc】,You can create and share your dataset according to our guidelines. 💪

If you have any questions or obstacles, please feel free to contact us OpenDataLab@pjlab.org.cn.