Deprecated : The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Data - a Testerpce Collection
Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in
LLMs Paper
• 2506.19290
• Published Jun 24, 2025 • 53
Data Efficacy for Language Model Training Paper
• 2506.21545
• Published Jun 26, 2025 • 11
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM
Fine-Tuning Data from Unstructured Documents Paper
• 2507.04009
• Published Jul 5, 2025 • 54
RefineX: Learning to Refine Pre-training Data at Scale from
Expert-Guided Programs Paper
• 2507.03253
• Published Jul 4, 2025 • 19
Scaling Laws for Optimal Data Mixtures Paper
• 2507.09404
• Published Jul 12, 2025 • 38
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning Paper
• 2507.16746
• Published Jul 22, 2025 • 35
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale
Pretraining Paper
• 2508.10975
• Published Aug 14, 2025 • 60
TiKMiX: Take Data Influence into Dynamic Mixture for Language Model
Pre-training Paper
• 2508.17677
• Published Aug 25, 2025 • 14
PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits Paper
• 2509.11362
• Published Sep 14, 2025 • 5
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation
and Editing Paper
• 2509.24900
• Published Sep 29, 2025 • 53
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI Paper
• 2512.16676
• Published Dec 18, 2025 • 222
Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs Paper
• 2601.17058
• Published Jan 22 • 190
OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration Paper
• 2602.05400
• Published Feb 5 • 352
DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning Paper
• 2602.16742
• Published Feb 18 • 12
DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models Paper
• 2603.26164
• Published 24 days ago • 354
Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs Paper
• 2604.10480
• Published 8 days ago • 20