Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-09-26T01:33:40.665Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7437449097633362},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"6739e51565cc0615ac5c8605","author":{"_id":"6506f03c2a9cebcc9bf98c27","avatarUrl":"/avatars/13db707582f71657f17b4e513730954d.svg","fullname":"oliver johnson","name":"fnckc","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-11-17T12:44:05.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2024-11-17T12:53:46.730Z","author":{"_id":"6506f03c2a9cebcc9bf98c27","avatarUrl":"/avatars/13db707582f71657f17b4e513730954d.svg","fullname":"oliver johnson","name":"fnckc","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":1,"editors":[],"editorAvatarUrls":[],"reactions":[]}}],"primaryEmailConfirmed":false,"paper":{"id":"2409.16191","authors":[{"_id":"66f37983685ee4f1ba09e0f1","user":{"_id":"632176664204950905b64e58","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/632176664204950905b64e58/IzkGIaAWy9285Q8uBncZ6.jpeg","isPro":false,"fullname":"quehry","user":"quehry","type":"user"},"name":"Haoran Que","status":"claimed_verified","statusLastChangedAt":"2024-09-25T07:32:32.953Z","hidden":false},{"_id":"66f37983685ee4f1ba09e0f2","name":"Feiyu Duan","hidden":false},{"_id":"66f37983685ee4f1ba09e0f3","user":{"_id":"64474bd91f308334030d086a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64474bd91f308334030d086a/TkyI2tfOMEYF9Sdja20Of.jpeg","isPro":false,"fullname":"Liqun He","user":"liqunhe27","type":"user"},"name":"Liqun He","status":"admin_assigned","statusLastChangedAt":"2024-09-25T18:35:59.010Z","hidden":false},{"_id":"66f37983685ee4f1ba09e0f4","name":"Yutao Mou","hidden":false},{"_id":"66f37983685ee4f1ba09e0f5","user":{"_id":"628c8598ef14f971b698107f","avatarUrl":"/avatars/3a4ad87e6b5f9e836a1160d869df1447.svg","isPro":false,"fullname":"Zhou","user":"Wangchunshu","type":"user"},"name":"Wangchunshu Zhou","status":"admin_assigned","statusLastChangedAt":"2024-09-25T18:36:11.181Z","hidden":false},{"_id":"66f37983685ee4f1ba09e0f6","user":{"_id":"65377c30e48353201e6fdda0","avatarUrl":"/avatars/a8f803b6f2e598eaee9c52c0d2ddfc16.svg","isPro":false,"fullname":"Jiaheng Liu","user":"CheeryLJH","type":"user"},"name":"Jiaheng Liu","status":"admin_assigned","statusLastChangedAt":"2024-09-25T18:36:17.540Z","hidden":false},{"_id":"66f37983685ee4f1ba09e0f7","name":"Wenge Rong","hidden":false},{"_id":"66f37983685ee4f1ba09e0f8","user":{"_id":"6149a9e95347647e6bb68882","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6149a9e95347647e6bb68882/Jddln1FxScCeVgTSCNBpr.png","isPro":false,"fullname":"Zekun Moore Wang","user":"ZenMoore","type":"user"},"name":"Zekun Moore Wang","status":"admin_assigned","statusLastChangedAt":"2024-09-25T18:36:30.673Z","hidden":false},{"_id":"66f37983685ee4f1ba09e0f9","user":{"_id":"64ccb9bfead94891d12aef42","avatarUrl":"/avatars/c54809d43d93d3f0766bd2555cecc4e3.svg","isPro":false,"fullname":"Yang Jian","user":"CSJianYang","type":"user"},"name":"Jian Yang","status":"claimed_verified","statusLastChangedAt":"2024-10-14T19:01:14.371Z","hidden":false},{"_id":"66f37983685ee4f1ba09e0fa","user":{"_id":"638efcf4c67af472d316d424","avatarUrl":"/avatars/97a57859d7d87a3a8f1bb41d32a72bc2.svg","isPro":false,"fullname":"Ge Zhang","user":"zhangysk","type":"user"},"name":"Ge Zhang","status":"claimed_verified","statusLastChangedAt":"2024-09-25T07:32:35.486Z","hidden":false},{"_id":"66f37983685ee4f1ba09e0fb","user":{"_id":"62c11a7360edb2dd7765b80d","avatarUrl":"/avatars/92db3258d0ba44daee22952f0644cd93.svg","isPro":false,"fullname":"Junran Peng","user":"JrPeng","type":"user"},"name":"Junran Peng","status":"admin_assigned","statusLastChangedAt":"2024-09-25T18:36:39.841Z","hidden":false},{"_id":"66f37983685ee4f1ba09e0fc","user":{"_id":"6350d59989def14ad21e11b3","avatarUrl":"/avatars/9a7ae6131e0a89f3461f267ee844b218.svg","isPro":false,"fullname":"z","user":"smallbert","type":"user"},"name":"Zhaoxiang Zhang","status":"admin_assigned","statusLastChangedAt":"2024-09-25T18:36:50.394Z","hidden":false},{"_id":"66f37983685ee4f1ba09e0fd","user":{"_id":"630716d11801ecc7d2595021","avatarUrl":"/avatars/2d36a880ce4a3cf7efc5ff3987dbeaf3.svg","isPro":false,"fullname":"Songyang Zhang","user":"zsytony","type":"user"},"name":"Songyang Zhang","status":"claimed_verified","statusLastChangedAt":"2024-09-25T07:32:28.146Z","hidden":false},{"_id":"66f37983685ee4f1ba09e0fe","name":"Kai Chen","hidden":false}],"publishedAt":"2024-09-24T15:38:11.000Z","submittedOnDailyAt":"2024-09-25T01:16:55.329Z","title":"HelloBench: Evaluating Long Text Generation Capabilities of Large\n Language Models","submittedOnDailyBy":{"_id":"630716d11801ecc7d2595021","avatarUrl":"/avatars/2d36a880ce4a3cf7efc5ff3987dbeaf3.svg","isPro":false,"fullname":"Songyang Zhang","user":"zsytony","type":"user"},"summary":"In recent years, Large Language Models (LLMs) have demonstrated remarkable\ncapabilities in various tasks (e.g., long-context understanding), and many\nbenchmarks have been proposed. However, we observe that long text generation\ncapabilities are not well investigated. Therefore, we introduce the\nHierarchical Long Text Generation Benchmark (HelloBench), a comprehensive,\nin-the-wild, and open-ended benchmark to evaluate LLMs' performance in\ngenerating long text. Based on Bloom's Taxonomy, HelloBench categorizes long\ntext generation tasks into five subtasks: open-ended QA, summarization, chat,\ntext completion, and heuristic text generation. Besides, we propose\nHierarchical Long Text Evaluation (HelloEval), a human-aligned evaluation\nmethod that significantly reduces the time and effort required for human\nevaluation while maintaining a high correlation with human evaluation. We have\nconducted extensive experiments across around 30 mainstream LLMs and observed\nthat the current LLMs lack long text generation capabilities. Specifically,\nfirst, regardless of whether the instructions include explicit or implicit\nlength constraints, we observe that most LLMs cannot generate text that is\nlonger than 4000 words. Second, we observe that while some LLMs can generate\nlonger text, many issues exist (e.g., severe repetition and quality\ndegradation). Third, to demonstrate the effectiveness of HelloEval, we compare\nHelloEval with traditional metrics (e.g., ROUGE, BLEU, etc.) and LLM-as-a-Judge\nmethods, which show that HelloEval has the highest correlation with human\nevaluation. We release our code in https://github.com/Quehry/HelloBench.","upvotes":41,"discussionId":"66f37984685ee4f1ba09e12f","githubRepo":"https://github.com/quehry/hellobench","githubRepoAddedBy":"auto","ai_summary":"A comprehensive benchmark, HelloBench, and evaluation method, HelloEval, are introduced to assess the long text generation capabilities of Large Language Models, revealing limitations and providing human-aligned evaluations.","ai_keywords":["Large Language Models","Hierarchical Long Text Generation Benchmark","HelloBench","Hierarchical Long Text Evaluation","HelloEval","Bloom's Taxonomy","open-ended QA","summarization","text completion","heuristic text generation","ROUGE","BLEU","LLM-as-a-Judge"],"githubStars":53},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"630716d11801ecc7d2595021","avatarUrl":"/avatars/2d36a880ce4a3cf7efc5ff3987dbeaf3.svg","isPro":false,"fullname":"Songyang Zhang","user":"zsytony","type":"user"},{"_id":"638efcf4c67af472d316d424","avatarUrl":"/avatars/97a57859d7d87a3a8f1bb41d32a72bc2.svg","isPro":false,"fullname":"Ge Zhang","user":"zhangysk","type":"user"},{"_id":"65377c30e48353201e6fdda0","avatarUrl":"/avatars/a8f803b6f2e598eaee9c52c0d2ddfc16.svg","isPro":false,"fullname":"Jiaheng Liu","user":"CheeryLJH","type":"user"},{"_id":"632176664204950905b64e58","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/632176664204950905b64e58/IzkGIaAWy9285Q8uBncZ6.jpeg","isPro":false,"fullname":"quehry","user":"quehry","type":"user"},{"_id":"6185022420f859b897595ab3","avatarUrl":"/avatars/41623dbf096755c28c4202363ad99a01.svg","isPro":false,"fullname":"simon","user":"david314","type":"user"},{"_id":"64f5f8dd9b17cd59c453c57f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64f5f8dd9b17cd59c453c57f/MulhwLcePFUWUQel8LQZ8.jpeg","isPro":false,"fullname":"Xinyu Fang","user":"nebulae09","type":"user"},{"_id":"614ffea450eec00bf3c23652","avatarUrl":"/avatars/0e89f66b16e16f016f0fb1663ead83f4.svg","isPro":false,"fullname":"lioushz","user":"Shz","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"66d82581b842183143b87da8","avatarUrl":"/avatars/8eb678c007879ba1e61272e31086c58b.svg","isPro":false,"fullname":"Jian Yang","user":"csjiaya","type":"user"},{"_id":"65d9903fdceb54d42011a98d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d9903fdceb54d42011a98d/5jnLeCY9sDtS98JyO9qzX.jpeg","isPro":false,"fullname":"meng shao","user":"meng-shao","type":"user"},{"_id":"64ba096e760936217a3ad2e2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ba096e760936217a3ad2e2/aNQK83Jg5PsBkY0UDg-RA.jpeg","isPro":false,"fullname":"Linzheng Chai","user":"Challenging666","type":"user"},{"_id":"66e79e58038300b07ac3ffb0","avatarUrl":"/avatars/6fd61502a39160fbd353d200ed4af529.svg","isPro":false,"fullname":"Song Zhenghao","user":"Huskyiiii","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">

Papers

arxiv:2409.16191

HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

Published on Sep 24, 2024

· Submitted by

Songyang Zhang on Sep 25, 2024

#1 Paper of the day

Upvote

Authors:

Haoran Que ,

Liqun He ,

Wangchunshu Zhou ,

Jiaheng Liu ,

Zekun Moore Wang ,

Jian Yang ,

Ge Zhang ,

Junran Peng ,

Zhaoxiang Zhang ,

Songyang Zhang ,

Abstract

A comprehensive benchmark, HelloBench, and evaluation method, HelloEval, are introduced to assess the long text generation capabilities of Large Language Models, revealing limitations and providing human-aligned evaluations.

AI-generated summary

In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks (e.g., long-context understanding), and many benchmarks have been proposed. However, we observe that long text generation capabilities are not well investigated. Therefore, we introduce the Hierarchical Long Text Generation Benchmark (HelloBench), a comprehensive, in-the-wild, and open-ended benchmark to evaluate LLMs' performance in generating long text. Based on Bloom's Taxonomy, HelloBench categorizes long text generation tasks into five subtasks: open-ended QA, summarization, chat, text completion, and heuristic text generation. Besides, we propose Hierarchical Long Text Evaluation (HelloEval), a human-aligned evaluation method that significantly reduces the time and effort required for human evaluation while maintaining a high correlation with human evaluation. We have conducted extensive experiments across around 30 mainstream LLMs and observed that the current LLMs lack long text generation capabilities. Specifically, first, regardless of whether the instructions include explicit or implicit length constraints, we observe that most LLMs cannot generate text that is longer than 4000 words. Second, we observe that while some LLMs can generate longer text, many issues exist (e.g., severe repetition and quality degradation). Third, to demonstrate the effectiveness of HelloEval, we compare HelloEval with traditional metrics (e.g., ROUGE, BLEU, etc.) and LLM-as-a-Judge methods, which show that HelloEval has the highest correlation with human evaluation. We release our code in https://github.com/Quehry/HelloBench.