datasets

zzfive 's Collections

world_model

VLA

RolePlaying

dLLM

industry

RAG

ssm

safety

inference optimization

updated Feb 12

Upvote

MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels

Paper • 2405.07526 • Published May 13, 2024 • 21
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

Paper • 2405.15613 • Published May 24, 2024 • 17
A Touch, Vision, and Language Dataset for Multimodal Alignment

Paper • 2402.13232 • Published Feb 20, 2024 • 16
How Do Large Language Models Acquire Factual Knowledge During Pretraining?

Paper • 2406.11813 • Published Jun 17, 2024 • 31
DataComp-LM: In search of the next generation of training sets for language models

Paper • 2406.11794 • Published Jun 17, 2024 • 55
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Paper • 2406.11833 • Published Jun 17, 2024 • 62
From Pixels to Prose: A Large Dataset of Dense Image Captions

Paper • 2406.10328 • Published Jun 14, 2024 • 18
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

Paper • 2406.11271 • Published Jun 17, 2024 • 21
StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images

Paper • 2406.13735 • Published Jun 19, 2024 • 6
Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models

Paper • 2406.14599 • Published Jun 20, 2024 • 17
Scaling Synthetic Data Creation with 1,000,000,000 Personas

Paper • 2406.20094 • Published Jun 28, 2024 • 107
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity

Paper • 2406.17720 • Published Jun 25, 2024 • 8
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Paper • 2407.02371 • Published Jul 2, 2024 • 55
TabReD: A Benchmark of Tabular Machine Learning in-the-Wild

Paper • 2406.19380 • Published Jun 27, 2024 • 49
Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

Paper • 2407.03958 • Published Jul 4, 2024 • 21
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

Paper • 2407.06358 • Published Jul 8, 2024 • 19
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

Paper • 2407.10957 • Published Jul 15, 2024 • 24
YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus

Paper • 2407.11144 • Published Jul 15, 2024 • 10
Visual Text Generation in the Wild

Paper • 2407.14138 • Published Jul 19, 2024 • 9
VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks

Paper • 2407.19795 • Published Jul 29, 2024 • 11
Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation

Paper • 2408.00205 • Published Aug 1, 2024 • 5
VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Paper • 2408.02629 • Published Aug 5, 2024 • 15
MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

Paper • 2408.02900 • Published Aug 6, 2024 • 31
Diffusion Models as Data Mining Tools

Paper • 2408.02752 • Published Jul 20, 2024 • 15
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

Paper • 2408.03900 • Published Aug 7, 2024 • 10
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

Paper • 2408.04594 • Published Aug 8, 2024 • 14
VGGHeads: A Large-Scale Synthetic Dataset for 3D Human Heads

Paper • 2407.18245 • Published Jul 25, 2024 • 12
MovieSum: An Abstractive Summarization Dataset for Movie Screenplays

Paper • 2408.06281 • Published Aug 12, 2024 • 9
InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning

Paper • 2408.07089 • Published Aug 9, 2024 • 14
DeepSpeak Dataset v1.0

Paper • 2408.05366 • Published Aug 9, 2024 • 14
D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning

Paper • 2408.08441 • Published Aug 15, 2024 • 8
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images

Paper • 2408.16176 • Published Aug 28, 2024 • 8
ClimDetect: A Benchmark Dataset for Climate Change Detection and Attribution

Paper • 2408.15993 • Published Aug 28, 2024 • 8
Kvasir-VQA: A Text-Image Pair GI Tract Dataset

Paper • 2409.01437 • Published Sep 2, 2024 • 71
The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts

Paper • 2409.00447 • Published Aug 31, 2024 • 3
HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

Paper • 2407.17438 • Published Jul 24, 2024 • 26
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

Paper • 2411.04709 • Published Nov 5, 2024 • 27
Improving the detection of technical debt in Java source code with an enriched dataset

Paper • 2411.05457 • Published Nov 8, 2024 • 2
GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models

Paper • 2411.05830 • Published Nov 5, 2024 • 21
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

Paper • 2411.07461 • Published Nov 12, 2024 • 23
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation

Paper • 2411.08380 • Published Nov 13, 2024 • 25
RedPajama: an Open Dataset for Training Large Language Models

Paper • 2411.12372 • Published Nov 19, 2024 • 58
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Paper • 2411.14794 • Published Nov 22, 2024 • 13
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

Paper • 2412.00927 • Published Dec 1, 2024 • 29
VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

Paper • 2412.00947 • Published Dec 1, 2024 • 8
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Paper • 2412.03304 • Published Dec 4, 2024 • 20
HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing

Paper • 2412.04280 • Published Dec 5, 2024 • 13
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Paper • 2412.05237 • Published Dec 6, 2024 • 46
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Paper • 2412.04626 • Published Dec 5, 2024 • 13
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations

Paper • 2412.08580 • Published Dec 11, 2024 • 45
MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation

Paper • 2412.07147 • Published Dec 10, 2024 • 5
VisionArena: 230K Real World User-VLM Conversations with Preference Labels

Paper • 2412.08687 • Published Dec 11, 2024 • 13
OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training

Paper • 2501.08197 • Published Jan 14, 2025 • 9
The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models

Paper • 2501.09653 • Published Jan 16, 2025 • 12
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

Paper • 2501.15907 • Published Jan 27, 2025 • 18
OpenCharacter: Training Customizable Role-Playing LLMs with Large-Scale Synthetic Personas

Paper • 2501.15427 • Published Jan 26, 2025 • 6
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

Paper • 2501.18511 • Published Jan 30, 2025 • 20
COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation

Paper • 2502.02589 • Published Feb 4, 2025 • 9
Generating Multi-Image Synthetic Data for Text-to-Image Customization

Paper • 2502.01720 • Published Feb 3, 2025 • 8
Expect the Unexpected: FailSafe Long Context QA for Finance

Paper • 2502.06329 • Published Feb 10, 2025 • 133
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation

Paper • 2502.07870 • Published Feb 11, 2025 • 45
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Paper • 2502.10391 • Published Feb 14, 2025 • 34
REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation

Paper • 2502.13270 • Published Feb 18, 2025 • 6
Audio-FLAN: A Preliminary Release

Paper • 2502.16584 • Published Feb 23, 2025 • 36
VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation

Paper • 2503.01739 • Published Mar 3, 2025 • 9
Qilin: A Multimodal Information Retrieval Dataset with APP-level User Sessions

Paper • 2503.00501 • Published Mar 1, 2025 • 12
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding

Paper • 2503.02951 • Published Mar 4, 2025 • 33
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

Paper • 2503.07920 • Published Mar 10, 2025 • 101
CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic Audiovisual Narrative Processing

Paper • 2503.06940 • Published Mar 10, 2025 • 11
DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation

Paper • 2503.06053 • Published Mar 8, 2025 • 138
ELTEX: A Framework for Domain-Driven Synthetic Data Generation

Paper • 2503.15055 • Published Mar 19, 2025 • 6
TeleAntiFraud-28k: A Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection

Paper • 2503.24115 • Published Mar 31, 2025 • 11
LiveVQA: Live Visual Knowledge Seeking

Paper • 2504.05288 • Published Apr 7, 2025 • 15
COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values

Paper • 2504.05535 • Published Apr 7, 2025 • 44
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Paper • 2504.11456 • Published Apr 15, 2025 • 12
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning

Paper • 2504.09081 • Published Apr 12, 2025 • 16
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

Paper • 2504.13161 • Published Apr 17, 2025 • 97
MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

Paper • 2504.13835 • Published Apr 18, 2025 • 38
LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs

Paper • 2504.14655 • Published Apr 20, 2025 • 21
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining

Paper • 2504.16511 • Published Apr 23, 2025 • 22
Dynamic Camera Poses and Where to Find Them

Paper • 2504.17788 • Published Apr 24, 2025 • 6
R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training

Paper • 2505.00358 • Published May 1, 2025 • 26
PrismLayers: Open Data for High-Quality Multi-Layer Transparent Image Generative Models

Paper • 2505.22523 • Published May 28, 2025 • 7
Large Language Models for Data Synthesis

Paper • 2505.14752 • Published May 20, 2025 • 49
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis

Paper • 2506.02096 • Published Jun 2, 2025 • 52
One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL

Paper • 2506.02338 • Published Jun 3, 2025 • 5
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5, 2025 • 61
Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery

Paper • 2506.05673 • Published Jun 6, 2025 • 10
CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models

Paper • 2506.07463 • Published Jun 9, 2025 • 11
Sekai: A Video Dataset towards World Exploration

Paper • 2506.15675 • Published Jun 18, 2025 • 66
Phantom-Data : Towards a General Subject-Consistent Video Generation Dataset

Paper • 2506.18851 • Published Jun 23, 2025 • 30
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation

Paper • 2506.18095 • Published Jun 22, 2025 • 66
BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

Paper • 2507.03483 • Published Jul 4, 2025 • 24
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

Paper • 2507.04009 • Published Jul 5, 2025 • 54
Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data

Paper • 2507.07095 • Published Jul 9, 2025 • 56
SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Paper • 2507.09862 • Published Jul 14, 2025 • 51
A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models

Paper • 2507.13563 • Published Jul 17, 2025 • 53
MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

Paper • 2507.16812 • Published Jul 22, 2025 • 64
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Paper • 2507.16746 • Published Jul 22, 2025 • 35
Multi-human Interactive Talking Dataset

Paper • 2508.03050 • Published Aug 5, 2025 • 10
VeriGUI: Verifiable Long-Chain GUI Dataset

Paper • 2508.04026 • Published Aug 6, 2025 • 164
FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality

Paper • 2508.00109 • Published Jul 31, 2025 • 4
Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences

Paper • 2508.03542 • Published Aug 5, 2025 • 5
Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation

Paper • 2508.09987 • Published Aug 13, 2025 • 25
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

Paper • 2508.21148 • Published Aug 28, 2025 • 142
TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

Paper • 2508.13618 • Published Aug 19, 2025 • 18
Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

Paper • 2509.04292 • Published Sep 4, 2025 • 58
Reverse-Engineered Reasoning for Open-Ended Generation

Paper • 2509.06160 • Published Sep 7, 2025 • 151
FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

Paper • 2509.09680 • Published Sep 11, 2025 • 44
SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

Paper • 2509.09676 • Published Sep 11, 2025 • 35
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling

Paper • 2509.12201 • Published Sep 15, 2025 • 107
PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits

Paper • 2509.11362 • Published Sep 14, 2025 • 5
MultiEdit: Advancing Instruction-based Image Editing on Diverse and Challenging Tasks

Paper • 2509.14638 • Published Sep 18, 2025 • 14
Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Paper • 2510.15742 • Published Oct 17, 2025 • 51
Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing

Paper • 2510.19808 • Published Oct 22, 2025 • 30
FineVision: Open Data Is All You Need

Paper • 2510.17269 • Published Oct 20, 2025 • 80
Fine-T2I: An Open, Large-Scale, and Diverse Dataset for High-Quality T2I Fine-Tuning

Paper • 2602.09439 • Published Feb 10 • 13

Upvote

Collection guide
Browse collections