Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - BigCodeArena: Unveiling More Reliable Human Preferences in Code
Generation via Execution
\n","updatedAt":"2025-10-13T19:15:59.857Z","author":{"_id":"62b7fb545233925f253531c8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b7fb545233925f253531c8/W50u2G1HK3EtUKHRU189V.jpeg","fullname":"Terry Yue Zhuo","name":"terryyz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":32,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.37965813279151917},"editors":["terryyz"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62b7fb545233925f253531c8/W50u2G1HK3EtUKHRU189V.jpeg"],"reactions":[{"reaction":"🔥","users":["taesiri","hngl","ZennyKenny","mihai-chindris"],"count":4},{"reaction":"🤗","users":["taesiri","mihai-chindris"],"count":2}],"isReport":false}},{"id":"68eda90a7e38a10ce93ec21f","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-10-14T01:36:10.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback](https://huggingface.co/papers/2510.06186) (2025)\n* [SR-Eval: Evaluating LLMs on Code Generation under Stepwise Requirement Refinement](https://huggingface.co/papers/2509.18808) (2025)\n* [RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment](https://huggingface.co/papers/2509.10436) (2025)\n* [Another Turn, Better Output? A Turn-Wise Analysis of Iterative LLM Prompting](https://huggingface.co/papers/2509.06770) (2025)\n* [Beyond Autoregression: An Empirical Study of Diffusion Large Language Models for Code Generation](https://huggingface.co/papers/2509.11252) (2025)\n* [SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?](https://huggingface.co/papers/2510.05444) (2025)\n* [Vibe Checker: Aligning Code Evaluation with Human Preference](https://huggingface.co/papers/2510.07315) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-10-14T01:36:10.867Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6830528378486633},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2510.08697","authors":[{"_id":"68ec5e33cd07fb414898c90f","user":{"_id":"62b7fb545233925f253531c8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b7fb545233925f253531c8/W50u2G1HK3EtUKHRU189V.jpeg","isPro":false,"fullname":"Terry Yue Zhuo","user":"terryyz","type":"user"},"name":"Terry Yue Zhuo","status":"claimed_verified","statusLastChangedAt":"2025-10-14T07:32:32.606Z","hidden":false},{"_id":"68ec5e33cd07fb414898c910","name":"Xiaolong Jin","hidden":false},{"_id":"68ec5e33cd07fb414898c911","user":{"_id":"68ed502c3af6e7780c27e225","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/5sOHq7dCHekpKQcPN8IUR.png","isPro":false,"fullname":"Hange Liu","user":"hngl","type":"user"},"name":"Hange Liu","status":"claimed_verified","statusLastChangedAt":"2025-10-14T07:32:28.359Z","hidden":false},{"_id":"68ec5e33cd07fb414898c912","name":"Juyong Jiang","hidden":false},{"_id":"68ec5e33cd07fb414898c913","name":"Tianyang Liu","hidden":false},{"_id":"68ec5e33cd07fb414898c914","name":"Chen Gong","hidden":false},{"_id":"68ec5e33cd07fb414898c915","name":"Bhupesh Bishnoi","hidden":false},{"_id":"68ec5e33cd07fb414898c916","name":"Vaisakhi Mishra","hidden":false},{"_id":"68ec5e33cd07fb414898c917","name":"Marek Suppa","hidden":false},{"_id":"68ec5e33cd07fb414898c918","name":"Noah Ziems","hidden":false},{"_id":"68ec5e33cd07fb414898c919","name":"Saiteja Utpala","hidden":false},{"_id":"68ec5e33cd07fb414898c91a","name":"Ming Xu","hidden":false},{"_id":"68ec5e33cd07fb414898c91b","name":"Guangyu Song","hidden":false},{"_id":"68ec5e33cd07fb414898c91c","user":{"_id":"6346be8f7fb9f11870c63998","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6346be8f7fb9f11870c63998/tFWawSkXL6bv1zgvzFWQd.png","isPro":false,"fullname":"Kaixin Li","user":"likaixin","type":"user"},"name":"Kaixin Li","status":"claimed_verified","statusLastChangedAt":"2025-10-14T07:32:21.682Z","hidden":false},{"_id":"68ec5e33cd07fb414898c91d","name":"Yuhan Cao","hidden":false},{"_id":"68ec5e33cd07fb414898c91e","user":{"_id":"635e3a76106f984574c36409","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667120725800-635e3a76106f984574c36409.png","isPro":false,"fullname":"Bo Liu","user":"Benjamin-eecs","type":"user"},"name":"Bo Liu","status":"claimed_verified","statusLastChangedAt":"2025-10-13T10:06:03.836Z","hidden":false},{"_id":"68ec5e33cd07fb414898c91f","name":"Zheng Liu","hidden":false},{"_id":"68ec5e33cd07fb414898c920","name":"Sabina Abdurakhmanova","hidden":false},{"_id":"68ec5e33cd07fb414898c921","name":"Wenhao Yu","hidden":false},{"_id":"68ec5e33cd07fb414898c922","name":"Mengzhao Jia","hidden":false},{"_id":"68ec5e33cd07fb414898c923","name":"Jihan Yao","hidden":false},{"_id":"68ec5e33cd07fb414898c924","user":{"_id":"656e3808d4de03a07d116850","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656e3808d4de03a07d116850/62cFw46AmuhdI3gS24F1M.jpeg","isPro":true,"fullname":"Kenneth Hamilton","user":"ZennyKenny","type":"user"},"name":"Kenneth Hamilton","status":"claimed_verified","statusLastChangedAt":"2025-10-14T07:32:30.438Z","hidden":false},{"_id":"68ec5e33cd07fb414898c925","name":"Kumar Shridhar","hidden":false},{"_id":"68ec5e33cd07fb414898c926","user":{"_id":"60535c9d10aba34e3b6a2ef7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1616075850230-noauth.jpeg","isPro":false,"fullname":"vumichien","user":"vumichien","type":"user"},"name":"Minh Chien Vu","status":"claimed_verified","statusLastChangedAt":"2025-10-14T07:32:24.000Z","hidden":false},{"_id":"68ec5e33cd07fb414898c927","name":"Dingmin Wang","hidden":false},{"_id":"68ec5e33cd07fb414898c928","name":"Jiawei Liu","hidden":false},{"_id":"68ec5e33cd07fb414898c929","name":"Zijian Wang","hidden":false},{"_id":"68ec5e33cd07fb414898c92a","name":"Qian Liu","hidden":false},{"_id":"68ec5e33cd07fb414898c92b","name":"Binyuan Hui","hidden":false},{"_id":"68ec5e33cd07fb414898c92c","name":"Meg Risdal","hidden":false},{"_id":"68ec5e33cd07fb414898c92d","user":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"name":"Ahsen Khaliq","status":"admin_assigned","statusLastChangedAt":"2025-10-13T12:48:46.417Z","hidden":false},{"_id":"68ec5e33cd07fb414898c92e","name":"Atin Sood","hidden":false},{"_id":"68ec5e33cd07fb414898c92f","name":"Zhenchang Xing","hidden":false},{"_id":"68ec5e33cd07fb414898c930","name":"Wasi Uddin Ahmad","hidden":false},{"_id":"68ec5e33cd07fb414898c931","name":"John Grundy","hidden":false},{"_id":"68ec5e33cd07fb414898c932","name":"David Lo","hidden":false},{"_id":"68ec5e33cd07fb414898c933","name":"Banghua Zhu","hidden":false},{"_id":"68ec5e33cd07fb414898c934","name":"Xiaoning Du","hidden":false},{"_id":"68ec5e33cd07fb414898c935","user":{"_id":"60ecaa5efee13fee7ada7af4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1637694419284-60ecaa5efee13fee7ada7af4.jpeg","isPro":false,"fullname":"Torsten Scholak","user":"tscholak","type":"user"},"name":"Torsten Scholak","status":"claimed_verified","statusLastChangedAt":"2025-10-17T04:14:25.480Z","hidden":false},{"_id":"68ec5e33cd07fb414898c936","user":{"_id":"5e48005437cb5b49818287a5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e48005437cb5b49818287a5/4uCXGGui-9QifAT4qelxU.png","isPro":false,"fullname":"Leandro von Werra","user":"lvwerra","type":"user"},"name":"Leandro von Werra","status":"claimed_verified","statusLastChangedAt":"2025-10-14T07:32:26.258Z","hidden":false}],"publishedAt":"2025-10-09T18:01:47.000Z","submittedOnDailyAt":"2025-10-13T00:34:36.848Z","title":"BigCodeArena: Unveiling More Reliable Human Preferences in Code\n Generation via Execution","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},"summary":"Crowdsourced model evaluation platforms, such as Chatbot Arena, enable\nreal-time evaluation from human perspectives to assess the quality of model\nresponses. In the coding domain, manually examining the quality of\nLLM-generated content is extremely challenging, as it requires understanding\nlong chunks of raw code and deliberately simulating code execution. To this\nend, we introduce BigCodeArena, an open human evaluation platform for code\ngeneration backed by a comprehensive and on-the-fly execution environment.\nBuilt on top of Chatbot Arena, BigCodeArena enables the execution of\nLLM-generated code and allows humans to interact with the execution process and\noutcomes. We collected over 14,000 raw code-centric conversation sessions\nacross 10 widely used LLMs, spanning 10 languages and 8 types of execution\nenvironments. Among these conversations, we identified more than 4,700\nmulti-turn samples with pairwise human preferences. Further analysis uncovers\nunderexplored preferences of LLMs in fine-grained domains characterized by\ntasks, languages, and frameworks. To systematically examine code understanding\nand generation capabilities of frontier LLMs, we curated two benchmarks based\non the collected data, namely BigCodeReward and AutoCodeArena. For\nBigCodeReward, we post-processed the 4,700 conversations and evaluated the\nconsistency between reward models and human preferences. The evaluation shows\nthat most LLMs have superior performance in judging coding preferences when the\nexecution results are available. Inspired by these findings, we propose\nAutoCodeArena, an automatic Elo rating benchmark designed to assess the coding\nquality of LLMs without human involvement. We find that proprietary LLMs like\nGPT-5, Claude-Sonnet-4, and Claude-Opus-4 still lead in code generation\nperformance among recent emerging models.","upvotes":39,"discussionId":"68ec5e33cd07fb414898c937","projectPage":"https://huggingface.co/spaces/bigcode/arena","githubRepo":"https://github.com/bigcode-project/bigcodearena","githubRepoAddedBy":"user","ai_summary":"BigCodeArena is an open human evaluation platform for code generation that enables real-time execution and interaction, revealing preferences and capabilities of LLMs in coding tasks.","ai_keywords":["Chatbot Arena","BigCodeArena","LLM-generated code","code execution","human evaluation","BigCodeReward","AutoCodeArena","Elo rating benchmark","code understanding","code generation"],"githubStars":58,"organization":{"_id":"62ce8f4248fbe688600093a0","name":"bigcode","fullname":"BigCode","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1659521200179-5e48005437cb5b49818287a5.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"686831346836397d6622ecbe","avatarUrl":"/avatars/4a52e1765438ffc30781b11a425eadec.svg","isPro":false,"fullname":"Thor Odinson","user":"noooobmaster69","type":"user"},{"_id":"635e3a76106f984574c36409","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667120725800-635e3a76106f984574c36409.png","isPro":false,"fullname":"Bo Liu","user":"Benjamin-eecs","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},{"_id":"5feab3a28a3201f8e554c969","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1660795228685-5feab3a28a3201f8e554c969.png","isPro":false,"fullname":"Wenhao Yu","user":"wyu1","type":"user"},{"_id":"656e3808d4de03a07d116850","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656e3808d4de03a07d116850/62cFw46AmuhdI3gS24F1M.jpeg","isPro":true,"fullname":"Kenneth Hamilton","user":"ZennyKenny","type":"user"},{"_id":"638f1fd8c4444c6ca86ff823","avatarUrl":"/avatars/405807c3868663246cfe371a2034f351.svg","isPro":false,"fullname":"saitejautpala","user":"saitejautpala","type":"user"},{"_id":"642bca844d7a550711e7beac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642bca844d7a550711e7beac/qE0btwUskAYeAmuAe82rh.jpeg","isPro":false,"fullname":"JillJia","user":"JillJia","type":"user"},{"_id":"60ecaa5efee13fee7ada7af4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1637694419284-60ecaa5efee13fee7ada7af4.jpeg","isPro":false,"fullname":"Torsten Scholak","user":"tscholak","type":"user"},{"_id":"644b584a9279988e0cbeb664","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/644b584a9279988e0cbeb664/fhWCI_Q26tTruhdFkjejw.jpeg","isPro":false,"fullname":"Jiawei Liu","user":"ganler","type":"user"},{"_id":"68ed502c3af6e7780c27e225","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/5sOHq7dCHekpKQcPN8IUR.png","isPro":false,"fullname":"Hange Liu","user":"hngl","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"62ce8f4248fbe688600093a0","name":"bigcode","fullname":"BigCode","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1659521200179-5e48005437cb5b49818287a5.png"}}">
BigCodeArena is an open human evaluation platform for code generation that enables real-time execution and interaction, revealing preferences and capabilities of LLMs in coding tasks.
AI-generated summary
Crowdsourced model evaluation platforms, such as Chatbot Arena, enable
real-time evaluation from human perspectives to assess the quality of model
responses. In the coding domain, manually examining the quality of
LLM-generated content is extremely challenging, as it requires understanding
long chunks of raw code and deliberately simulating code execution. To this
end, we introduce BigCodeArena, an open human evaluation platform for code
generation backed by a comprehensive and on-the-fly execution environment.
Built on top of Chatbot Arena, BigCodeArena enables the execution of
LLM-generated code and allows humans to interact with the execution process and
outcomes. We collected over 14,000 raw code-centric conversation sessions
across 10 widely used LLMs, spanning 10 languages and 8 types of execution
environments. Among these conversations, we identified more than 4,700
multi-turn samples with pairwise human preferences. Further analysis uncovers
underexplored preferences of LLMs in fine-grained domains characterized by
tasks, languages, and frameworks. To systematically examine code understanding
and generation capabilities of frontier LLMs, we curated two benchmarks based
on the collected data, namely BigCodeReward and AutoCodeArena. For
BigCodeReward, we post-processed the 4,700 conversations and evaluated the
consistency between reward models and human preferences. The evaluation shows
that most LLMs have superior performance in judging coding preferences when the
execution results are available. Inspired by these findings, we propose
AutoCodeArena, an automatic Elo rating benchmark designed to assess the coding
quality of LLMs without human involvement. We find that proprietary LLMs like
GPT-5, Claude-Sonnet-4, and Claude-Opus-4 still lead in code generation
performance among recent emerging models.
Crowdsourced model evaluation platforms, such as Chatbot Arena, enable real-time evaluation from human perspectives to assess the quality of model responses. In the coding domain, manually examining the quality of LLM-generated content is extremely challenging, as it requires understanding long chunks of raw code and deliberately simulating code execution. To this end, we introduce BigCodeArena, an open human evaluation platform for code generation backed by a comprehensive and on-the-fly execution environment. Built on top of Chatbot Arena, BigCodeArena enables the execution of LLM-generated code and allows humans to interact with the execution process and outcomes. We collected over 14,000 raw code-centric conversation sessions across 10 widely used LLMs, spanning 10 languages and 8 types of execution environments. Among these conversations, we identified more than 4,700 multi-turn samples with pairwise human preferences. Further analysis uncovers underexplored preferences of LLMs in fine-grained domains characterized by tasks, languages, and frameworks. To systematically examine code understanding and generation capabilities of frontier LLMs, we curated two benchmarks based on the collected data, namely BigCodeReward and AutoCodeArena. For BigCodeReward, we post-processed the 4,700 conversations and evaluated the consistency between reward models and human preferences. The evaluation shows that most LLMs have superior performance in judging coding preferences when the execution results are available. Inspired by these findings, we propose AutoCodeArena, an automatic Elo rating benchmark designed to assess the coding quality of LLMs without human involvement. We find that proprietary LLMs like GPT-5, Claude-Sonnet-4, and Claude-Opus-4 still lead in code generation performance among recent emerging models.