Deprecated : The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
datasets - a zzfive Collection
MS MARCO Web Search: a Large-scale Information-rich Web Dataset with
Millions of Real Click Labels Paper
• 2405.07526
• Published May 13, 2024 • 21
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based
Approach Paper
• 2405.15613
• Published May 24, 2024 • 17
A Touch, Vision, and Language Dataset for Multimodal Alignment Paper
• 2402.13232
• Published Feb 20, 2024 • 16
How Do Large Language Models Acquire Factual Knowledge During
Pretraining? Paper
• 2406.11813
• Published Jun 17, 2024 • 31
DataComp-LM: In search of the next generation of training sets for
language models Paper
• 2406.11794
• Published Jun 17, 2024 • 55
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and
Instruction-Tuning Dataset for LVLMs Paper
• 2406.11833
• Published Jun 17, 2024 • 62
From Pixels to Prose: A Large Dataset of Dense Image Captions Paper
• 2406.10328
• Published Jun 14, 2024 • 18
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal
Dataset with One Trillion Tokens Paper
• 2406.11271
• Published Jun 17, 2024 • 21
StableSemantics: A Synthetic Language-Vision Dataset of Semantic
Representations in Naturalistic Images Paper
• 2406.13735
• Published Jun 19, 2024 • 6
Stylebreeder: Exploring and Democratizing Artistic Styles through
Text-to-Image Models Paper
• 2406.14599
• Published Jun 20, 2024 • 17
Scaling Synthetic Data Creation with 1,000,000,000 Personas Paper
• 2406.20094
• Published Jun 28, 2024 • 107
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity Paper
• 2406.17720
• Published Jun 25, 2024 • 8
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video
Generation Paper
• 2407.02371
• Published Jul 2, 2024 • 55
TabReD: A Benchmark of Tabular Machine Learning in-the-Wild Paper
• 2406.19380
• Published Jun 27, 2024 • 49
Stark: Social Long-Term Multi-Modal Conversation with Persona
Commonsense Knowledge Paper
• 2407.03958
• Published Jul 4, 2024 • 21
MiraData: A Large-Scale Video Dataset with Long Durations and Structured
Captions Paper
• 2407.06358
• Published Jul 8, 2024 • 19
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes Paper
• 2407.10957
• Published Jul 15, 2024 • 24
YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language
Parallel Corpus Paper
• 2407.11144
• Published Jul 15, 2024 • 10
Visual Text Generation in the Wild Paper
• 2407.14138
• Published Jul 19, 2024 • 9
VolDoGer: LLM-assisted Datasets for Domain Generalization in
Vision-Language Tasks Paper
• 2407.19795
• Published Jul 29, 2024 • 11
Sentence-wise Speech Summarization: Task, Datasets, and End-to-End
Modeling with LM Knowledge Distillation Paper
• 2408.00205
• Published Aug 1, 2024 • 5
VidGen-1M: A Large-Scale Dataset for Text-to-video Generation Paper
• 2408.02629
• Published Aug 5, 2024 • 15
MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular
Annotations for Medicine Paper
• 2408.02900
• Published Aug 6, 2024 • 31
Diffusion Models as Data Mining Tools Paper
• 2408.02752
• Published Jul 20, 2024 • 15
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond Paper
• 2408.03900
• Published Aug 7, 2024 • 10
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language
Models Paper
• 2408.04594
• Published Aug 8, 2024 • 14
VGGHeads: A Large-Scale Synthetic Dataset for 3D Human Heads Paper
• 2407.18245
• Published Jul 25, 2024 • 12
MovieSum: An Abstractive Summarization Dataset for Movie Screenplays Paper
• 2408.06281
• Published Aug 12, 2024 • 9
InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic
Mathematical Reasoning Paper
• 2408.07089
• Published Aug 9, 2024 • 14
Paper
• 2408.05366
• Published Aug 9, 2024 • 14
D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning Paper
• 2408.08441
• Published Aug 15, 2024 • 8
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language
Models for Trait Discovery from Biological Images Paper
• 2408.16176
• Published Aug 28, 2024 • 8
ClimDetect: A Benchmark Dataset for Climate Change Detection and
Attribution Paper
• 2408.15993
• Published Aug 28, 2024 • 8
Kvasir-VQA: A Text-Image Pair GI Tract Dataset Paper
• 2409.01437
• Published Sep 2, 2024 • 71
The MERIT Dataset: Modelling and Efficiently Rendering Interpretable
Transcripts Paper
• 2409.00447
• Published Aug 31, 2024 • 3
HumanVid: Demystifying Training Data for Camera-controllable Human Image
Animation Paper
• 2407.17438
• Published Jul 24, 2024 • 26
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for
Image-to-Video Generation Paper
• 2411.04709
• Published Nov 5, 2024 • 27
Improving the detection of technical debt in Java source code with an
enriched dataset Paper
• 2411.05457
• Published Nov 8, 2024 • 2
GitChameleon: Unmasking the Version-Switching Capabilities of Code
Generation Models Paper
• 2411.05830
• Published Nov 5, 2024 • 21
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions Paper
• 2411.07461
• Published Nov 12, 2024 • 23
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video
Generation Paper
• 2411.08380
• Published Nov 13, 2024 • 25
RedPajama: an Open Dataset for Training Large Language Models Paper
• 2411.12372
• Published Nov 19, 2024 • 58
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained
Video Reasoning via Core Frame Selection Paper
• 2411.14794
• Published Nov 22, 2024 • 13
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding
by Video Spatiotemporal Augmentation Paper
• 2412.00927
• Published Dec 1, 2024 • 29
VisOnlyQA: Large Vision Language Models Still Struggle with Visual
Perception of Geometric Information Paper
• 2412.00947
• Published Dec 1, 2024 • 8
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases
in Multilingual Evaluation Paper
• 2412.03304
• Published Dec 4, 2024 • 20
HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based
Image Editing Paper
• 2412.04280
• Published Dec 5, 2024 • 13
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at
Scale Paper
• 2412.05237
• Published Dec 6, 2024 • 46
BigDocs: An Open and Permissively-Licensed Dataset for Training
Multimodal Models on Document and Code Tasks Paper
• 2412.04626
• Published Dec 5, 2024 • 13
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex
Image-Text Models with Structural Annotations Paper
• 2412.08580
• Published Dec 11, 2024 • 45
MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation Paper
• 2412.07147
• Published Dec 10, 2024 • 5
VisionArena: 230K Real World User-VLM Conversations with Preference
Labels Paper
• 2412.08687
• Published Dec 11, 2024 • 13
OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for
LLM Training Paper
• 2501.08197
• Published Jan 14, 2025 • 9
The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating
Large Language Models Paper
• 2501.09653
• Published Jan 16, 2025 • 12
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for
Speech Generation Paper
• 2501.15907
• Published Jan 27, 2025 • 18
OpenCharacter: Training Customizable Role-Playing LLMs with Large-Scale
Synthetic Personas Paper
• 2501.15427
• Published Jan 26, 2025 • 6
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in
Post-Training Paper
• 2501.18511
• Published Jan 30, 2025 • 20
COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for
Fine-Grained Understanding and Generation Paper
• 2502.02589
• Published Feb 4, 2025 • 9
Generating Multi-Image Synthetic Data for Text-to-Image Customization Paper
• 2502.01720
• Published Feb 3, 2025 • 8
Expect the Unexpected: FailSafe Long Context QA for Finance Paper
• 2502.06329
• Published Feb 10, 2025 • 133
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation Paper
• 2502.07870
• Published Feb 11, 2025 • 45
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment Paper
• 2502.10391
• Published Feb 14, 2025 • 34
REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation Paper
• 2502.13270
• Published Feb 18, 2025 • 6
Audio-FLAN: A Preliminary Release Paper
• 2502.16584
• Published Feb 23, 2025 • 36
VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video
Generation Paper
• 2503.01739
• Published Mar 3, 2025 • 9
Qilin: A Multimodal Information Retrieval Dataset with APP-level User
Sessions Paper
• 2503.00501
• Published Mar 1, 2025 • 12
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for
Coding Paper
• 2503.02951
• Published Mar 4, 2025 • 33
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural
Vision-Language Dataset for Southeast Asia Paper
• 2503.07920
• Published Mar 10, 2025 • 101
CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic
Audiovisual Narrative Processing Paper
• 2503.06940
• Published Mar 10, 2025 • 11
DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal
Consistent Video Generation Paper
• 2503.06053
• Published Mar 8, 2025 • 138
ELTEX: A Framework for Domain-Driven Synthetic Data Generation Paper
• 2503.15055
• Published Mar 19, 2025 • 6
TeleAntiFraud-28k: A Audio-Text Slow-Thinking Dataset for Telecom Fraud
Detection Paper
• 2503.24115
• Published Mar 31, 2025 • 11
LiveVQA: Live Visual Knowledge Seeking Paper
• 2504.05288
• Published Apr 7, 2025 • 15
COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for
Alignment with Human Values Paper
• 2504.05535
• Published Apr 7, 2025 • 44
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and
Verifiable Mathematical Dataset for Advancing Reasoning Paper
• 2504.11456
• Published Apr 15, 2025 • 12
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction
Fine-Tuning Paper
• 2504.09081
• Published Apr 12, 2025 • 16
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for
Language Model Pre-training Paper
• 2504.13161
• Published Apr 17, 2025 • 97
MIG: Automatic Data Selection for Instruction Tuning by Maximizing
Information Gain in Semantic Space Paper
• 2504.13835
• Published Apr 18, 2025 • 38
LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient
Training of Code LLMs Paper
• 2504.14655
• Published Apr 20, 2025 • 21
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM
Pretraining Paper
• 2504.16511
• Published Apr 23, 2025 • 22
Dynamic Camera Poses and Where to Find Them Paper
• 2504.17788
• Published Apr 24, 2025 • 6
R&B: Domain Regrouping and Data Mixture Balancing for Efficient
Foundation Model Training Paper
• 2505.00358
• Published May 1, 2025 • 26
PrismLayers: Open Data for High-Quality Multi-Layer Transparent Image
Generative Models Paper
• 2505.22523
• Published May 28, 2025 • 7
Large Language Models for Data Synthesis Paper
• 2505.14752
• Published May 20, 2025 • 49
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis Paper
• 2506.02096
• Published Jun 2, 2025 • 52
One Missing Piece for Open-Source Reasoning Models: A Dataset to
Mitigate Cold-Starting Short CoT LLMs in RL Paper
• 2506.02338
• Published Jun 3, 2025 • 5
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly
Licensed Text Paper
• 2506.05209
• Published Jun 5, 2025 • 61
Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning
Vision Models from DataSeeds' Annotated Imagery Paper
• 2506.05673
• Published Jun 6, 2025 • 10
CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large
Language Models Paper
• 2506.07463
• Published Jun 9, 2025 • 11
Sekai: A Video Dataset towards World Exploration Paper
• 2506.15675
• Published Jun 18, 2025 • 66
Phantom-Data : Towards a General Subject-Consistent Video Generation
Dataset Paper
• 2506.18851
• Published Jun 23, 2025 • 30
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image
Generation Paper
• 2506.18095
• Published Jun 22, 2025 • 66
BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning
Dataset Paper
• 2507.03483
• Published Jul 4, 2025 • 24
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM
Fine-Tuning Data from Unstructured Documents Paper
• 2507.04009
• Published Jul 5, 2025 • 54
Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data Paper
• 2507.07095
• Published Jul 9, 2025 • 56
SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual
Dyadic Interactive Human Generation Paper
• 2507.09862
• Published Jul 14, 2025 • 51
A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges
in Russian Speech Generative Models Paper
• 2507.13563
• Published Jul 17, 2025 • 53
MegaScience: Pushing the Frontiers of Post-Training Datasets for Science
Reasoning Paper
• 2507.16812
• Published Jul 22, 2025 • 64
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning Paper
• 2507.16746
• Published Jul 22, 2025 • 35
Multi-human Interactive Talking Dataset Paper
• 2508.03050
• Published Aug 5, 2025 • 10
VeriGUI: Verifiable Long-Chain GUI Dataset Paper
• 2508.04026
• Published Aug 6, 2025 • 164
FACTORY: A Challenging Human-Verified Prompt Set for Long-Form
Factuality Paper
• 2508.00109
• Published Jul 31, 2025 • 4
Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations
and Sentences Paper
• 2508.03542
• Published Aug 5, 2025 • 5
Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved
Image Generation Paper
• 2508.09987
• Published Aug 13, 2025 • 25
A Survey of Scientific Large Language Models: From Data Foundations to
Agent Frontiers Paper
• 2508.21148
• Published Aug 28, 2025 • 142
TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head
Synthesis Paper
• 2508.13618
• Published Aug 19, 2025 • 18
Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow
Real Instructions? Paper
• 2509.04292
• Published Sep 4, 2025 • 58
Reverse-Engineered Reasoning for Open-Ended Generation Paper
• 2509.06160
• Published Sep 7, 2025 • 151
FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning
Dataset and Comprehensive Benchmark Paper
• 2509.09680
• Published Sep 11, 2025 • 44
SpatialVID: A Large-Scale Video Dataset with Spatial Annotations Paper
• 2509.09676
• Published Sep 11, 2025 • 35
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling Paper
• 2509.12201
• Published Sep 15, 2025 • 107
PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits Paper
• 2509.11362
• Published Sep 14, 2025 • 5
MultiEdit: Advancing Instruction-based Image Editing on Diverse and
Challenging Tasks Paper
• 2509.14638
• Published Sep 18, 2025 • 14
Scaling Instruction-Based Video Editing with a High-Quality Synthetic
Dataset Paper
• 2510.15742
• Published Oct 17, 2025 • 51
Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing Paper
• 2510.19808
• Published Oct 22, 2025 • 30
FineVision: Open Data Is All You Need Paper
• 2510.17269
• Published Oct 20, 2025 • 80
Fine-T2I: An Open, Large-Scale, and Diverse Dataset for High-Quality T2I Fine-Tuning Paper
• 2602.09439
• Published Feb 10 • 13