Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
https://github.com/OpenDCAI/DataFlow\n","updatedAt":"2025-12-23T03:37:14.303Z","author":{"_id":"6671214c92412fd4640714eb","avatarUrl":"/avatars/48fa84e7bc3bb92ad0192aa26b32de10.svg","fullname":"bohan zeng","name":"zbhpku","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8471966981887817},"editors":["zbhpku"],"editorAvatarUrls":["/avatars/48fa84e7bc3bb92ad0192aa26b32de10.svg"],"reactions":[],"isReport":false}},{"id":"694a15935b54d6322aca194e","author":{"_id":"67c06f81562f69fd3dce231e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/OHQoUdh_lhBfvGEmAVVPV.png","fullname":"Keesh","name":"Miaode","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-12-23T04:07:47.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Does this project have any restrictions on which models can be used? For example, are the use of Gemini or Claude limited? Can I use my own locally deployed models? ","html":"
Does this project have any restrictions on which models can be used? For example, are the use of Gemini or Claude limited? Can I use my own locally deployed models?
\n","updatedAt":"2025-12-23T04:07:47.017Z","author":{"_id":"67c06f81562f69fd3dce231e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/OHQoUdh_lhBfvGEmAVVPV.png","fullname":"Keesh","name":"Miaode","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9274120330810547},"editors":["Miaode"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/OHQoUdh_lhBfvGEmAVVPV.png"],"reactions":[],"isReport":false},"replies":[{"id":"694a5c6dc90cf8210d7bb866","author":{"_id":"6751a4fedf636b0140a9b873","avatarUrl":"/avatars/d75f7f6cfbfb4d646e0e557d1cfacdce.svg","fullname":"Hao Liang","name":"lhpku20010120","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-12-23T09:10:05.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"The DataFlow serving module supports both open-source models and locally deployed models.","html":"
The DataFlow serving module supports both open-source models and locally deployed models.
\n","updatedAt":"2025-12-23T09:10:05.620Z","author":{"_id":"6751a4fedf636b0140a9b873","avatarUrl":"/avatars/d75f7f6cfbfb4d646e0e557d1cfacdce.svg","fullname":"Hao Liang","name":"lhpku20010120","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9298335909843445},"editors":["lhpku20010120"],"editorAvatarUrls":["/avatars/d75f7f6cfbfb4d646e0e557d1cfacdce.svg"],"reactions":[{"reaction":"👍","users":["Miaode","zbhpku","yuyijiong","aaron1141"],"count":4}],"isReport":false,"parentCommentId":"694a15935b54d6322aca194e"}}]},{"id":"694ac9fc6d03127b5855bf78","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2025-12-23T16:57:32.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXiv lens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/dataflow-an-llm-driven-framework-for-unified-data-preparation-and-workflow-automation-in-the-era-of-data-centric-ai-3906-5f097fd0\n- Key Findings\n- Executive Summary\n- Detailed Breakdown\n- Practical Applications","html":"
\n","updatedAt":"2025-12-23T16:57:32.391Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7125627398490906},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[{"reaction":"🔥","users":["owao","zbhpku"],"count":2}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2512.16676","authors":[{"_id":"6949026334f46eaf46cbb3d1","name":"Hao Liang","hidden":false},{"_id":"6949026334f46eaf46cbb3d2","user":{"_id":"65099d08f37afbab0d3fb268","avatarUrl":"/avatars/cef45b7c6b7c90bbef341a39a9bb51be.svg","isPro":false,"fullname":"Xiaochen Ma","user":"Sunnyhaze","type":"user"},"name":"Xiaochen Ma","status":"claimed_verified","statusLastChangedAt":"2025-12-25T20:51:26.454Z","hidden":false},{"_id":"6949026334f46eaf46cbb3d3","name":"Zhou Liu","hidden":false},{"_id":"6949026334f46eaf46cbb3d4","user":{"_id":"65d3149f0545ab7c11568b2b","avatarUrl":"/avatars/02f36d8320b8331849e7d7f40e509ca8.svg","isPro":false,"fullname":"Zhen Hao Wong","user":"aaron1141","type":"user"},"name":"Zhen Hao Wong","status":"claimed_verified","statusLastChangedAt":"2025-12-25T20:51:24.004Z","hidden":false},{"_id":"6949026334f46eaf46cbb3d5","name":"Zhengyang Zhao","hidden":false},{"_id":"6949026334f46eaf46cbb3d6","name":"Zimo Meng","hidden":false},{"_id":"6949026334f46eaf46cbb3d7","name":"Runming He","hidden":false},{"_id":"6949026334f46eaf46cbb3d8","name":"Chengyu Shen","hidden":false},{"_id":"6949026334f46eaf46cbb3d9","name":"Qifeng Cai","hidden":false},{"_id":"6949026334f46eaf46cbb3da","name":"Zhaoyang Han","hidden":false},{"_id":"6949026334f46eaf46cbb3db","name":"Meiyi Qiang","hidden":false},{"_id":"6949026334f46eaf46cbb3dc","name":"Yalin Feng","hidden":false},{"_id":"6949026334f46eaf46cbb3dd","name":"Tianyi Bai","hidden":false},{"_id":"6949026334f46eaf46cbb3de","user":{"_id":"658a907a15b65eb9ba2f52e8","avatarUrl":"/avatars/64575a37a6e449ad525877a70eafa1be.svg","isPro":false,"fullname":"zewei pan","user":"pzp5700","type":"user"},"name":"Zewei Pan","status":"claimed_verified","statusLastChangedAt":"2025-12-29T14:25:06.525Z","hidden":false},{"_id":"6949026334f46eaf46cbb3df","name":"Ziyi Guo","hidden":false},{"_id":"6949026334f46eaf46cbb3e0","name":"Yizhen Jiang","hidden":false},{"_id":"6949026334f46eaf46cbb3e1","name":"Jingwen Deng","hidden":false},{"_id":"6949026334f46eaf46cbb3e2","name":"Qijie You","hidden":false},{"_id":"6949026334f46eaf46cbb3e3","name":"Peichao Lai","hidden":false},{"_id":"6949026334f46eaf46cbb3e4","name":"Tianyu Guo","hidden":false},{"_id":"6949026334f46eaf46cbb3e5","name":"Chi Hsu Tsai","hidden":false},{"_id":"6949026334f46eaf46cbb3e6","name":"Hengyi Feng","hidden":false},{"_id":"6949026334f46eaf46cbb3e7","name":"Rui Hu","hidden":false},{"_id":"6949026334f46eaf46cbb3e8","name":"Wenkai Yu","hidden":false},{"_id":"6949026334f46eaf46cbb3e9","name":"Junbo Niu","hidden":false},{"_id":"6949026334f46eaf46cbb3ea","name":"Bohan Zeng","hidden":false},{"_id":"6949026334f46eaf46cbb3eb","name":"Ruichuan An","hidden":false},{"_id":"6949026334f46eaf46cbb3ec","name":"Lu Ma","hidden":false},{"_id":"6949026334f46eaf46cbb3ed","name":"Jihao Huang","hidden":false},{"_id":"6949026334f46eaf46cbb3ee","user":{"_id":"642fef28a043f0ac7defa8a9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642fef28a043f0ac7defa8a9/RwOEkuj3fOnOA54tGR7Ea.png","isPro":false,"fullname":"Yaowei Zheng","user":"hiyouga","type":"user"},"name":"Yaowei Zheng","status":"claimed_verified","statusLastChangedAt":"2025-12-31T20:57:58.594Z","hidden":false},{"_id":"6949026334f46eaf46cbb3ef","name":"Conghui He","hidden":false},{"_id":"6949026334f46eaf46cbb3f0","name":"Linpeng Tang","hidden":false},{"_id":"6949026334f46eaf46cbb3f1","name":"Bin Cui","hidden":false},{"_id":"6949026334f46eaf46cbb3f2","name":"Weinan E","hidden":false},{"_id":"6949026334f46eaf46cbb3f3","name":"Wentao Zhang","hidden":false}],"publishedAt":"2025-12-18T15:46:15.000Z","submittedOnDailyAt":"2025-12-23T01:07:14.287Z","title":"DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI","submittedOnDailyBy":{"_id":"6671214c92412fd4640714eb","avatarUrl":"/avatars/48fa84e7bc3bb92ad0192aa26b32de10.svg","isPro":false,"fullname":"bohan zeng","user":"zbhpku","type":"user"},"summary":"The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\\% execution accuracy in Text-to-SQL over SynSQL, +7\\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.","upvotes":219,"discussionId":"6949026334f46eaf46cbb3f4","projectPage":"https://github.com/OpenDCAI/DataFlow","githubRepo":"https://github.com/OpenDCAI/DataFlow","githubRepoAddedBy":"user","ai_summary":"DataFlow is an LLM-driven data preparation framework that enhances data quality and reproducibility for various tasks, improving LLM performance with automatically generated pipelines.","ai_keywords":["DataFlow","Large Language Models (LLMs)","data preparation pipelines","system-level abstractions","PyTorch-style pipeline construction API","reusable operators","domain-general pipelines","Text-to-SQL","agentic RAG","large-scale knowledge extraction","DataFlow-Agent","operator synthesis","pipeline planning","iterative verification"],"githubStars":2921,"organization":{"_id":"61dcd8e344f59573371b5cb6","name":"PekingUniversity","fullname":"Peking University","avatar":"https://cdn-uploads.huggingface.co/production/uploads/noauth/vavgrBsnkSejriUF4lXDE.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6671214c92412fd4640714eb","avatarUrl":"/avatars/48fa84e7bc3bb92ad0192aa26b32de10.svg","isPro":false,"fullname":"bohan zeng","user":"zbhpku","type":"user"},{"_id":"6751a4fedf636b0140a9b873","avatarUrl":"/avatars/d75f7f6cfbfb4d646e0e557d1cfacdce.svg","isPro":false,"fullname":"Hao Liang","user":"lhpku20010120","type":"user"},{"_id":"650abbb71aece923f21d87fc","avatarUrl":"/avatars/f09ff031c278bc42bfd7a563853e142c.svg","isPro":false,"fullname":"Junbo Niu","user":"Niujunbo2002","type":"user"},{"_id":"65099d08f37afbab0d3fb268","avatarUrl":"/avatars/cef45b7c6b7c90bbef341a39a9bb51be.svg","isPro":false,"fullname":"Xiaochen Ma","user":"Sunnyhaze","type":"user"},{"_id":"660781a450d2b7a71091240d","avatarUrl":"/avatars/da9439b8920605d8427893d0ebc32dfa.svg","isPro":false,"fullname":"Bohan Zeng","user":"zbh0217","type":"user"},{"_id":"6217599529500f41901123f8","avatarUrl":"/avatars/8a0fe54e53fe6527c70a78598a0cd941.svg","isPro":false,"fullname":"Hao Liang","user":"lhbit20010120","type":"user"},{"_id":"66ac9567c97d2f0c88c3ac72","avatarUrl":"/avatars/14df8b5eed4ea756c93f61999c75e44f.svg","isPro":false,"fullname":"PKU_Baichuan","user":"PKU-Baichuan","type":"user"},{"_id":"6618a60721d5003025004c96","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6618a60721d5003025004c96/_9DZG3lKbIt5KOj4WxsIJ.jpeg","isPro":false,"fullname":"Meiyi Qiang","user":"MeiyiQiang","type":"user"},{"_id":"65362255c7530aa27fdffa5c","avatarUrl":"/avatars/65e301436b85b457deaa07b1d0b8d5a1.svg","isPro":false,"fullname":"Bingkui Tong","user":"tbbbk","type":"user"},{"_id":"68a4097be15c4fb382de7e83","avatarUrl":"/avatars/42bf3f76ea9d0bd23f68ad654699c755.svg","isPro":false,"fullname":"王者荣耀","user":"caigou6","type":"user"},{"_id":"694a1012d500ff51ac3f7c45","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/MySKy3-aRz--eidEqff7T.png","isPro":false,"fullname":"Min Wei Cheah","user":"minwei0326","type":"user"},{"_id":"645899697a7e192202dee8e3","avatarUrl":"/avatars/bb209dd569a262afe49ce94a70b57c25.svg","isPro":false,"fullname":"Andy Zhang","user":"UltraDoughnut","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1,"organization":{"_id":"61dcd8e344f59573371b5cb6","name":"PekingUniversity","fullname":"Peking University","avatar":"https://cdn-uploads.huggingface.co/production/uploads/noauth/vavgrBsnkSejriUF4lXDE.png"}}">
DataFlow is an LLM-driven data preparation framework that enhances data quality and reproducibility for various tasks, improving LLM performance with automatically generated pipelines.
AI-generated summary
The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\% execution accuracy in Text-to-SQL over SynSQL, +7\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.
Does this project have any restrictions on which models can be used? For example, are the use of Gemini or Claude limited? Can I use my own locally deployed models?