Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding
[go: Go Back, main page]

Project Website \n
  • đź“„ Technical Report - Discover the methodology and technical details behind KodCode
  • \n
  • đź’ľ Github Repo - Access the complete pipeline used to produce KodCode V1
  • \n
  • 🤗 HF Datasets: KodCode-V1 (For RL); KodCode-V1-SFT-R1 (for SFT)
  • \n\n","updatedAt":"2025-03-06T02:31:01.640Z","author":{"_id":"653df1323479e9ebbe3eb6cc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/653df1323479e9ebbe3eb6cc/K_g-r1iMRNKj99LXPuYF3.jpeg","fullname":"Zhangchen Xu","name":"zhangchenxu","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":26,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8236393928527832},"editors":["zhangchenxu"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/653df1323479e9ebbe3eb6cc/K_g-r1iMRNKj99LXPuYF3.jpeg"],"reactions":[],"isReport":false}},{"id":"67ca4cedeb6f53145c852411","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-03-07T01:33:33.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation](https://huggingface.co/papers/2502.14948) (2025)\n* [ACECODER: Acing Coder RL via Automated Test-Case Synthesis](https://huggingface.co/papers/2502.01718) (2025)\n* [IterPref: Focal Preference Learning for Code Generation via Iterative Debugging](https://huggingface.co/papers/2503.02783) (2025)\n* [RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation](https://huggingface.co/papers/2502.09183) (2025)\n* [Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity](https://huggingface.co/papers/2502.11901) (2025)\n* [Robust Learning of Diverse Code Edits](https://huggingface.co/papers/2503.03656) (2025)\n* [Scoring Verifiers: Evaluating Synthetic Verification in Code and Reasoning](https://huggingface.co/papers/2502.13820) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

    This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

    \n

    The following papers were recommended by the Semantic Scholar API

    \n\n

    Please give a thumbs up to this comment if you found it helpful!

    \n

    If you want recommendations for any Paper on Hugging Face checkout this Space

    \n

    You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

    \n","updatedAt":"2025-03-07T01:33:33.723Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7141962051391602},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"67ca926ff9e51bff8a5ed396","author":{"_id":"6588fb0415b65eb9baf70591","avatarUrl":"/avatars/13d0a5a3efa2802c71294ab484da453d.svg","fullname":"James Freeman","name":"jamesfreeman","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-03-07T06:30:07.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Any clue as to why it does so poorly on the “hard” LiveCodeBench benchmark? Your models were almost dead last on that specific benchmark and the score you all received on the “medium” one represented a huge fall off from the “easy” questions where you all received SOA scores among all open source models. \n\nNot suggesting anything here by pointing that out, just genuinely curious to see if you all have a hypothesis that would explain your model’s notably poor performance on that one specific benchmark (relative to the other models whose scores you provided, namely Qwen & DeepSeek variations). ","html":"

    Any clue as to why it does so poorly on the “hard” LiveCodeBench benchmark? Your models were almost dead last on that specific benchmark and the score you all received on the “medium” one represented a huge fall off from the “easy” questions where you all received SOA scores among all open source models.

    \n

    Not suggesting anything here by pointing that out, just genuinely curious to see if you all have a hypothesis that would explain your model’s notably poor performance on that one specific benchmark (relative to the other models whose scores you provided, namely Qwen & DeepSeek variations).

    \n","updatedAt":"2025-03-07T06:30:07.122Z","author":{"_id":"6588fb0415b65eb9baf70591","avatarUrl":"/avatars/13d0a5a3efa2802c71294ab484da453d.svg","fullname":"James Freeman","name":"jamesfreeman","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.969617486000061},"editors":["jamesfreeman"],"editorAvatarUrls":["/avatars/13d0a5a3efa2802c71294ab484da453d.svg"],"reactions":[],"isReport":false},"replies":[{"id":"67ca94f5dc2df985dff77b0c","author":{"_id":"653df1323479e9ebbe3eb6cc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/653df1323479e9ebbe3eb6cc/K_g-r1iMRNKj99LXPuYF3.jpeg","fullname":"Zhangchen Xu","name":"zhangchenxu","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":26,"isUserFollowing":false},"createdAt":"2025-03-07T06:40:53.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"I have a hypothesis, though I'm not entirely certain if this is the root cause. When examining LiveCodeBench's questions, I noticed many utilize Online Judge style inputs and outputs (receiving data through stdin and producing results via stdout). In contrast, we developed KodCode using pytest as our testing framework, which evaluates functions through direct return values and assertions. This might explain the performance degragation.\n\n(If that is the case, I will probably create a new oj style subset)","html":"

    I have a hypothesis, though I'm not entirely certain if this is the root cause. When examining LiveCodeBench's questions, I noticed many utilize Online Judge style inputs and outputs (receiving data through stdin and producing results via stdout). In contrast, we developed KodCode using pytest as our testing framework, which evaluates functions through direct return values and assertions. This might explain the performance degragation.

    \n

    (If that is the case, I will probably create a new oj style subset)

    \n","updatedAt":"2025-03-07T08:05:01.540Z","author":{"_id":"653df1323479e9ebbe3eb6cc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/653df1323479e9ebbe3eb6cc/K_g-r1iMRNKj99LXPuYF3.jpeg","fullname":"Zhangchen Xu","name":"zhangchenxu","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":26,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9078660011291504},"editors":["zhangchenxu"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/653df1323479e9ebbe3eb6cc/K_g-r1iMRNKj99LXPuYF3.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"67ca926ff9e51bff8a5ed396"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2503.02951","authors":[{"_id":"67c907ea7568a12737ad4535","user":{"_id":"653df1323479e9ebbe3eb6cc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/653df1323479e9ebbe3eb6cc/K_g-r1iMRNKj99LXPuYF3.jpeg","isPro":true,"fullname":"Zhangchen Xu","user":"zhangchenxu","type":"user"},"name":"Zhangchen Xu","status":"claimed_verified","statusLastChangedAt":"2025-03-06T09:26:50.636Z","hidden":false},{"_id":"67c907ea7568a12737ad4536","user":{"_id":"637c88b6d55081513c5690d8","avatarUrl":"/avatars/6766e23ebf46b46d6c8b48351c571907.svg","isPro":false,"fullname":"Yang Liu","user":"nlpyang","type":"user"},"name":"Yang Liu","status":"extracted_pending","statusLastChangedAt":"2025-03-06T02:26:54.940Z","hidden":false},{"_id":"67c907ea7568a12737ad4537","user":{"_id":"605e8dfd5abeb13e714c4c18","avatarUrl":"/avatars/bc27a0ed17b2bd4311e89d3028fa327b.svg","isPro":true,"fullname":"yueqin yin","user":"yyqoni","type":"user"},"name":"Yueqin Yin","status":"claimed_verified","statusLastChangedAt":"2025-03-06T09:26:48.614Z","hidden":false},{"_id":"67c907ea7568a12737ad4538","user":{"_id":"653b2524b77b5e255f2d29d2","avatarUrl":"/avatars/f69aea8de84c435295e7638bad5bd82e.svg","isPro":true,"fullname":"Mingyuan Zhou","user":"mingyuanzhou","type":"user"},"name":"Mingyuan Zhou","status":"admin_assigned","statusLastChangedAt":"2025-03-06T10:03:56.474Z","hidden":false},{"_id":"67c907ea7568a12737ad4539","name":"Radha Poovendran","hidden":false}],"publishedAt":"2025-03-04T19:17:36.000Z","submittedOnDailyAt":"2025-03-06T00:01:01.626Z","title":"KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for\n Coding","submittedOnDailyBy":{"_id":"653df1323479e9ebbe3eb6cc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/653df1323479e9ebbe3eb6cc/K_g-r1iMRNKj99LXPuYF3.jpeg","isPro":true,"fullname":"Zhangchen Xu","user":"zhangchenxu","type":"user"},"summary":"We introduce KodCode, a synthetic dataset that addresses the persistent\nchallenge of acquiring high-quality, verifiable training data across diverse\ndifficulties and domains for training Large Language Models for coding.\nExisting code-focused resources typically fail to ensure either the breadth of\ncoverage (e.g., spanning simple coding tasks to advanced algorithmic problems)\nor verifiable correctness (e.g., unit tests). In contrast, KodCode comprises\nquestion-solution-test triplets that are systematically validated via a\nself-verification procedure. Our pipeline begins by synthesizing a broad range\nof coding questions, then generates solutions and test cases with additional\nattempts allocated to challenging problems. Finally, post-training data\nsynthesis is done by rewriting questions into diverse formats and generating\nresponses under a test-based reject sampling procedure from a reasoning model\n(DeepSeek R1). This pipeline yields a large-scale, robust and diverse coding\ndataset. KodCode is suitable for supervised fine-tuning and the paired unit\ntests also provide great potential for RL tuning. Fine-tuning experiments on\ncoding benchmarks (HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench)\ndemonstrate that KodCode-tuned models achieve state-of-the-art performance,\nsurpassing models like Qwen2.5-Coder-32B-Instruct and\nDeepSeek-R1-Distill-Llama-70B.","upvotes":33,"discussionId":"67c907ee7568a12737ad4633","projectPage":"https://kodcode-ai.github.io/","githubRepo":"https://github.com/KodCode-AI/kodcode","githubRepoAddedBy":"user","ai_summary":"KodCode is a synthetic coding dataset that ensures broad coverage and correctness through systematic validation, enabling state-of-the-art performance in coding benchmarks.","ai_keywords":["Larger Language Models","coding","question-solution-test triplets","self-verification procedure","reasoning model","DeepSeek R1","supervised fine-tuning","RL tuning","HumanEval","MBPP","BigCodeBench","LiveCodeBench","Qwen2.5-Coder-32B-Instruct","DeepSeek-R1-Distill-Llama-70B"],"githubStars":311},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"653df1323479e9ebbe3eb6cc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/653df1323479e9ebbe3eb6cc/K_g-r1iMRNKj99LXPuYF3.jpeg","isPro":true,"fullname":"Zhangchen Xu","user":"zhangchenxu","type":"user"},{"_id":"605e8dfd5abeb13e714c4c18","avatarUrl":"/avatars/bc27a0ed17b2bd4311e89d3028fa327b.svg","isPro":true,"fullname":"yueqin yin","user":"yyqoni","type":"user"},{"_id":"65dafc22ad7ccf910d7144da","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65dafc22ad7ccf910d7144da/bsGJXsVjwJTVoqSO0b1O3.jpeg","isPro":false,"fullname":"Yuetai Li","user":"TaiGary","type":"user"},{"_id":"66277df33d64429f4595f9be","avatarUrl":"/avatars/b6ec05883441c8316452945b1f3fefb3.svg","isPro":false,"fullname":"yang yu","user":"yangyangcici","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"63b2a92e18e5cf2cdd333492","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b2a92e18e5cf2cdd333492/GxnngJG0u7d0jYTEFOrfe.png","isPro":false,"fullname":"Jaehyun Jun","user":"btjhjeon","type":"user"},{"_id":"650c8bfb3d3542884da1a845","avatarUrl":"/avatars/863a5deebf2ac6d4faedc4dd368e0561.svg","isPro":false,"fullname":"Adhurim ","user":"Limi07","type":"user"},{"_id":"6303eda97b50dd9d0a36c731","avatarUrl":"/avatars/2f2188e46286c71606feb3b2b77a91f4.svg","isPro":false,"fullname":"Renat","user":"u-brixton","type":"user"},{"_id":"665b133508d536a8ac804f7d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Uwi0OnANdTbRbHHQvGqvR.png","isPro":false,"fullname":"Paulson","user":"Pnaomi","type":"user"},{"_id":"651c80a26ba9ab9b9582c273","avatarUrl":"/avatars/e963452eafd21f517d800f2e58e0f918.svg","isPro":false,"fullname":"siyeng feng","user":"siyengfeng","type":"user"},{"_id":"65962a79f67e8fb2a57cd338","avatarUrl":"/avatars/e4c4652f7aaa3ab1585b8f7a4f972311.svg","isPro":false,"fullname":"HNO3","user":"HNO333333","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
    Papers
    arxiv:2503.02951

    KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding

    Published on Mar 4, 2025
    · Submitted by
    Zhangchen Xu
    on Mar 6, 2025
    Authors:

    Abstract

    KodCode is a synthetic coding dataset that ensures broad coverage and correctness through systematic validation, enabling state-of-the-art performance in coding benchmarks.

    AI-generated summary

    We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training Large Language Models for coding. Existing code-focused resources typically fail to ensure either the breadth of coverage (e.g., spanning simple coding tasks to advanced algorithmic problems) or verifiable correctness (e.g., unit tests). In contrast, KodCode comprises question-solution-test triplets that are systematically validated via a self-verification procedure. Our pipeline begins by synthesizing a broad range of coding questions, then generates solutions and test cases with additional attempts allocated to challenging problems. Finally, post-training data synthesis is done by rewriting questions into diverse formats and generating responses under a test-based reject sampling procedure from a reasoning model (DeepSeek R1). This pipeline yields a large-scale, robust and diverse coding dataset. KodCode is suitable for supervised fine-tuning and the paired unit tests also provide great potential for RL tuning. Fine-tuning experiments on coding benchmarks (HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench) demonstrate that KodCode-tuned models achieve state-of-the-art performance, surpassing models like Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B.

    Community

    Paper author Paper submitter

    KodCode is the largest fully-synthetic open-source dataset providing verifiable solutions and tests for coding tasks. It contains 12 distinct subsets spanning various domains (from algorithmic to package-specific knowledge) and difficulty levels (from basic coding exercises to interview and competitive programming challenges). KodCode is designed for both supervised fine-tuning (SFT) and RL tuning.

    This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

    The following papers were recommended by the Semantic Scholar API

    Please give a thumbs up to this comment if you found it helpful!

    If you want recommendations for any Paper on Hugging Face checkout this Space

    You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

    Any clue as to why it does so poorly on the “hard” LiveCodeBench benchmark? Your models were almost dead last on that specific benchmark and the score you all received on the “medium” one represented a huge fall off from the “easy” questions where you all received SOA scores among all open source models.

    Not suggesting anything here by pointing that out, just genuinely curious to see if you all have a hypothesis that would explain your model’s notably poor performance on that one specific benchmark (relative to the other models whose scores you provided, namely Qwen & DeepSeek variations).

    ·

    I have a hypothesis, though I'm not entirely certain if this is the root cause. When examining LiveCodeBench's questions, I noticed many utilize Online Judge style inputs and outputs (receiving data through stdin and producing results via stdout). In contrast, we developed KodCode using pytest as our testing framework, which evaluates functions through direct return values and assertions. This might explain the performance degragation.

    (If that is the case, I will probably create a new oj style subset)

    Sign up or log in to comment

    Models citing this paper 0

    No model linking this paper

    Cite arxiv.org/abs/2503.02951 in a model README.md to link it from this page.

    Datasets citing this paper 4

    Spaces citing this paper 0

    No Space linking this paper

    Cite arxiv.org/abs/2503.02951 in a Space README.md to link it from this page.

    Collections including this paper 8