Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Evaluating Language Models for Efficient Code Generation
https://github.com/evalplus/evalplus :)\n","updatedAt":"2024-10-20T22:44:13.535Z","author":{"_id":"644b584a9279988e0cbeb664","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/644b584a9279988e0cbeb664/fhWCI_Q26tTruhdFkjejw.jpeg","fullname":"Jiawei Liu","name":"ganler","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":28,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7000250220298767},"editors":["ganler"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/644b584a9279988e0cbeb664/fhWCI_Q26tTruhdFkjejw.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2408.06450","authors":[{"_id":"66e33c75d2244dfe4d3c8760","user":{"_id":"644b584a9279988e0cbeb664","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/644b584a9279988e0cbeb664/fhWCI_Q26tTruhdFkjejw.jpeg","isPro":false,"fullname":"Jiawei Liu","user":"ganler","type":"user"},"name":"Jiawei Liu","status":"claimed_verified","statusLastChangedAt":"2024-10-21T07:50:02.963Z","hidden":false},{"_id":"66e33c75d2244dfe4d3c8761","name":"Songrun Xie","hidden":false},{"_id":"66e33c75d2244dfe4d3c8762","name":"Junhao Wang","hidden":false},{"_id":"66e33c75d2244dfe4d3c8763","user":{"_id":"632a176259950c1d279d5ea7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/632a176259950c1d279d5ea7/xsSGhBXalt9RaKzSKY8uk.jpeg","isPro":false,"fullname":"Yuxiang Wei","user":"yuxiang630","type":"user"},"name":"Yuxiang Wei","status":"claimed_verified","statusLastChangedAt":"2024-11-03T14:24:11.528Z","hidden":false},{"_id":"66e33c75d2244dfe4d3c8764","user":{"_id":"64ff904013f154652687bd80","avatarUrl":"/avatars/a04deda1d5c9db0c15a9311806a4b2dc.svg","isPro":true,"fullname":"Yifeng Ding","user":"YifengDing","type":"user"},"name":"Yifeng Ding","status":"claimed_verified","statusLastChangedAt":"2024-10-07T15:10:24.304Z","hidden":false},{"_id":"66e33c75d2244dfe4d3c8765","name":"Lingming Zhang","hidden":false}],"publishedAt":"2024-08-12T18:59:13.000Z","title":"Evaluating Language Models for Efficient Code Generation","summary":"We introduce Differential Performance Evaluation (DPE), a framework designed\nto reliably evaluate Large Language Models (LLMs) for efficient code\ngeneration. Traditional coding benchmarks often fail to provide reliable\ninsights into code efficiency, due to their reliance on simplistic test inputs\nand the absence of effective compound metrics. DPE addresses these issues by\nfocusing on efficiency-demanding programming tasks and establishing an\ninsightful compound metric for performance evaluation. DPE operates in two\nphases: To curate efficiency datasets, it selects efficiency-demanding tasks\nfrom existing coding benchmarks and generates computationally expensive inputs\nto stress the efficiency of LLM solutions. To assess the code efficiency, DPE\nprofiles the new solution and compares it globally against a set of reference\nsolutions that exhibit distinct efficiency levels, where the matched level\ndefines its efficiency score. As a proof of concept, we use DPE to create\nEvalPerf, a benchmark with 121 performance-challenging coding tasks. Our\ncomprehensive evaluation draws interesting findings on the efficiency impact of\nmodel sizes, instruction tuning, and prompting. For example, while the scaling\nlaw fails to account for code efficiency, general instruction tuning benefits\nboth code correctness and efficiency. We also evaluate the evaluation by\nexamining the effectiveness of DPE, showing that EvalPerf is reliable and\nconvenient to use even across platforms.","upvotes":0,"discussionId":"66e33c76d2244dfe4d3c8790","ai_summary":"A framework for evaluating code efficiency of Large Language Models using a compound metric and curated efficiency-demanding tasks.","ai_keywords":["Large Language Models","Differential Performance Evaluation","code generation","coding benchmarks","efficiency-demanding tasks","compound metric","computational expensive inputs","reference solutions","efficiency score","EvalPerf","performance-challenging coding tasks","instruction tuning","prompting"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["*"]}">
A framework for evaluating code efficiency of Large Language Models using a compound metric and curated efficiency-demanding tasks.
AI-generated summary
We introduce Differential Performance Evaluation (DPE), a framework designed
to reliably evaluate Large Language Models (LLMs) for efficient code
generation. Traditional coding benchmarks often fail to provide reliable
insights into code efficiency, due to their reliance on simplistic test inputs
and the absence of effective compound metrics. DPE addresses these issues by
focusing on efficiency-demanding programming tasks and establishing an
insightful compound metric for performance evaluation. DPE operates in two
phases: To curate efficiency datasets, it selects efficiency-demanding tasks
from existing coding benchmarks and generates computationally expensive inputs
to stress the efficiency of LLM solutions. To assess the code efficiency, DPE
profiles the new solution and compares it globally against a set of reference
solutions that exhibit distinct efficiency levels, where the matched level
defines its efficiency score. As a proof of concept, we use DPE to create
EvalPerf, a benchmark with 121 performance-challenging coding tasks. Our
comprehensive evaluation draws interesting findings on the efficiency impact of
model sizes, instruction tuning, and prompting. For example, while the scaling
law fails to account for code efficiency, general instruction tuning benefits
both code correctness and efficiency. We also evaluate the evaluation by
examining the effectiveness of DPE, showing that EvalPerf is reliable and
convenient to use even across platforms.