https://da-code-bench.github.io/

This EMNLP2024 paper introduces DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously design the evaluation suite to ensure the accuracy and robustness of the evaluation. We develop the DA-Agent baseline. Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only 30.5% accuracy, leaving ample room for improvement.

\n","updatedAt":"2024-10-14T08:57:37.554Z","author":{"_id":"63fb6e281b4b1bd4e7ffc5be","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63fb6e281b4b1bd4e7ffc5be/aiRu_bulgnxvEMrjipGoQ.jpeg","fullname":"Xiao Liu","name":"lx865712528","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false}},"numEdits":0,"editors":["lx865712528"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63fb6e281b4b1bd4e7ffc5be/aiRu_bulgnxvEMrjipGoQ.jpeg"],"reactions":[],"isReport":false}},{"id":"670dc679aa00e89d82f44c07","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-10-15T01:33:45.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories](https://huggingface.co/papers/2409.07440) (2024)\n* [HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale](https://huggingface.co/papers/2409.16299) (2024)\n* [DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?](https://huggingface.co/papers/2409.07703) (2024)\n* [Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion?](https://huggingface.co/papers/2410.01353) (2024)\n* [Steering Large Language Models between Code Execution and Textual Reasoning](https://huggingface.co/papers/2410.03524) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-10-15T01:33:45.861Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"670e9b5c7506e316af846760","author":{"_id":"651def66d0656f67a5f431b4","avatarUrl":"/avatars/ac7a992cc29e52fc39350b1ef347042d.svg","fullname":"Thomas Yiu","name":"legolasyiu","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":40,"isUserFollowing":false},"createdAt":"2024-10-15T16:42:04.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This paper interesting. I have a model that you can test for this benchmark - EpistemeAI/Fireball-Meta-Llama-3.1-8B-Instruct-Agent-0.003-128K-code-ds","html":"

This paper interesting. I have a model that you can test for this benchmark - EpistemeAI/Fireball-Meta-Llama-3.1-8B-Instruct-Agent-0.003-128K-code-ds

\n","updatedAt":"2024-10-15T16:42:04.840Z","author":{"_id":"651def66d0656f67a5f431b4","avatarUrl":"/avatars/ac7a992cc29e52fc39350b1ef347042d.svg","fullname":"Thomas Yiu","name":"legolasyiu","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":40,"isUserFollowing":false}},"numEdits":0,"editors":["legolasyiu"],"editorAvatarUrls":["/avatars/ac7a992cc29e52fc39350b1ef347042d.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2410.07331","authors":[{"_id":"670cdca490a771a1f03eec6c","user":{"_id":"6514599ee31c0e2e3dfb5c9c","avatarUrl":"/avatars/3c3ebd14d228c4c439da542cf8ff20a8.svg","isPro":false,"fullname":"ymh233","user":"ymh233","type":"user"},"name":"Yiming Huang","status":"claimed_verified","statusLastChangedAt":"2025-09-23T02:43:44.344Z","hidden":false},{"_id":"670cdca490a771a1f03eec6d","user":{"_id":"66adf5cc0c6056d9f4dc308f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66adf5cc0c6056d9f4dc308f/mVKo06P7M1qf6RYNG-c2i.jpeg","isPro":false,"fullname":"Jane Luo","user":"Luo2003","type":"user"},"name":"Jianwen Luo","status":"admin_assigned","statusLastChangedAt":"2024-10-14T10:51:58.359Z","hidden":false},{"_id":"670cdca490a771a1f03eec6e","name":"Yan Yu","hidden":false},{"_id":"670cdca490a771a1f03eec6f","name":"Yitong Zhang","hidden":false},{"_id":"670cdca490a771a1f03eec70","user":{"_id":"64104b467a15af878ae6695d","avatarUrl":"/avatars/407983918c12411e5ed636bf7435522b.svg","isPro":false,"fullname":"Fangyu Lei","user":"FangyuLei","type":"user"},"name":"Fangyu Lei","status":"admin_assigned","statusLastChangedAt":"2024-10-14T10:52:29.944Z","hidden":false},{"_id":"670cdca490a771a1f03eec71","user":{"_id":"665c3d590e92f92b0ee233ad","avatarUrl":"/avatars/ee4bbf2872ccd5625196966e235f40f7.svg","isPro":false,"fullname":"Yifan Wei","user":"bjEdward","type":"user"},"name":"Yifan Wei","status":"admin_assigned","statusLastChangedAt":"2024-10-14T10:52:36.300Z","hidden":false},{"_id":"670cdca490a771a1f03eec72","name":"Shizhu He","hidden":false},{"_id":"670cdca490a771a1f03eec73","name":"Lifu Huang","hidden":false},{"_id":"670cdca490a771a1f03eec74","user":{"_id":"63fb6e281b4b1bd4e7ffc5be","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63fb6e281b4b1bd4e7ffc5be/aiRu_bulgnxvEMrjipGoQ.jpeg","isPro":false,"fullname":"Xiao Liu","user":"lx865712528","type":"user"},"name":"Xiao Liu","status":"claimed_verified","statusLastChangedAt":"2024-10-14T10:47:40.156Z","hidden":false},{"_id":"670cdca490a771a1f03eec75","name":"Jun Zhao","hidden":false},{"_id":"670cdca490a771a1f03eec76","name":"Kang Liu","hidden":false}],"publishedAt":"2024-10-09T18:00:05.000Z","submittedOnDailyAt":"2024-10-14T07:27:37.548Z","title":"DA-Code: Agent Data Science Code Generation Benchmark for Large Language\n Models","submittedOnDailyBy":{"_id":"63fb6e281b4b1bd4e7ffc5be","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63fb6e281b4b1bd4e7ffc5be/aiRu_bulgnxvEMrjipGoQ.jpeg","isPro":false,"fullname":"Xiao Liu","user":"lx865712528","type":"user"},"summary":"We introduce DA-Code, a code generation benchmark specifically designed to\nassess LLMs on agent-based data science tasks. This benchmark features three\ncore elements: First, the tasks within DA-Code are inherently challenging,\nsetting them apart from traditional code generation tasks and demanding\nadvanced coding skills in grounding and planning. Second, examples in DA-Code\nare all based on real and diverse data, covering a wide range of complex data\nwrangling and analytics tasks. Third, to solve the tasks, the models must\nutilize complex data science programming languages, to perform intricate data\nprocessing and derive the answers. We set up the benchmark in a controllable\nand executable environment that aligns with real-world data analysis scenarios\nand is scalable. The annotators meticulously design the evaluation suite to\nensure the accuracy and robustness of the evaluation. We develop the DA-Agent\nbaseline. Experiments show that although the baseline performs better than\nother existing frameworks, using the current best LLMs achieves only 30.5%\naccuracy, leaving ample room for improvement. We release our benchmark at\nhttps://da-code-bench.github.io.","upvotes":5,"discussionId":"670cdca590a771a1f03eed19","projectPage":"https://da-code-bench.github.io/","githubRepo":"https://github.com/yiyihum/da-code","githubRepoAddedBy":"auto","ai_summary":"A code generation benchmark, DA-Code, evaluates LLMs on agent-based data science tasks using real-world, diverse datasets and complex data science programming languages.","ai_keywords":["LLMs","DA-Code","agent-based data science tasks","data wrangling","analytics tasks","data science programming languages","DA-Agent baseline"],"githubStars":92},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"643b19f8a856622f978df30f","avatarUrl":"/avatars/c82779fdf94f80cdb5020504f83c818b.svg","isPro":false,"fullname":"Yatharth Sharma","user":"YaTharThShaRma999","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"651def66d0656f67a5f431b4","avatarUrl":"/avatars/ac7a992cc29e52fc39350b1ef347042d.svg","isPro":true,"fullname":"Thomas Yiu","user":"legolasyiu","type":"user"},{"_id":"663ccbff3a74a20189d4aa2e","avatarUrl":"/avatars/83a54455e0157480f65c498cd9057cf2.svg","isPro":false,"fullname":"Nguyen Van Thanh","user":"NguyenVanThanhHust","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">

Papers

arxiv:2410.07331

DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

Published on Oct 9, 2024

· Submitted by

Xiao Liu on Oct 14, 2024

Upvote

Authors:

Yiming Huang ,

Jianwen Luo ,

Fangyu Lei ,

Yifan Wei ,

Xiao Liu ,

Abstract

A code generation benchmark, DA-Code, evaluates LLMs on agent-based data science tasks using real-world, diverse datasets and complex data science programming languages.

AI-generated summary

We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages, to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously design the evaluation suite to ensure the accuracy and robustness of the evaluation. We develop the DA-Agent baseline. Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only 30.5% accuracy, leaving ample room for improvement. We release our benchmark at https://da-code-bench.github.io.