Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs
[go: Go Back, main page]

Papers
arxiv:2401.03855

PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs

Published on Jan 8, 2024
Authors:
,

Abstract

A novel benchmark, PythonSaga, addresses biases in existing Python code generation benchmarks by featuring diverse programming concepts and difficulty levels.

Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of HumanEval and MBPP, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks, potentially inflating model performance estimations. To address these limitations, we propose a novel benchmark, PythonSaga, featuring 185 hand-crafted prompts on a balanced representation of 38 programming concepts across diverse difficulty levels. The robustness of our benchmark is demonstrated by the poor performance of existing Code-LLMs.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2401.03855
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2401.03855 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2401.03855 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2401.03855 in a Space README.md to link it from this page.

Collections including this paper 1