https://github.com/Dominic789654/LongGenBench\n","updatedAt":"2024-10-09T02:41:06.257Z","author":{"_id":"63024676056ec3a2a8714b24","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1661093436322-noauth.jpeg","fullname":"Xiang Liu","name":"Dominic789654","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"editors":["Dominic789654"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1661093436322-noauth.jpeg"],"reactions":[{"reaction":"๐","users":["AdinaY","marinaretik"],"count":2},{"reaction":"๐","users":["AdinaY"],"count":1}],"isReport":false}},{"id":"6706a5a6d76eecf2c49af535","author":{"_id":"664c96529543c9c6e949c2dd","avatarUrl":"/avatars/e270d47f8e18dac66f86aadc96b0730a.svg","fullname":"joe kostikov","name":"Slitherysloth","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-10-09T15:47:50.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"conda create -yn LongGenBench python=3.9 \nconda activate LongGenBench\npip install -r requirements.txt\n","html":"
conda create -yn LongGenBench python=3.9
conda activate LongGenBench
pip install -r requirements.txt
\n","updatedAt":"2024-10-09T15:47:50.685Z","author":{"_id":"664c96529543c9c6e949c2dd","avatarUrl":"/avatars/e270d47f8e18dac66f86aadc96b0730a.svg","fullname":"joe kostikov","name":"Slitherysloth","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"editors":["Slitherysloth"],"editorAvatarUrls":["/avatars/e270d47f8e18dac66f86aadc96b0730a.svg"],"reactions":[],"isReport":false}},{"id":"67072eec24f097e7f71d5b3b","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-10-10T01:33:32.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs](https://huggingface.co/papers/2409.02076) (2024)\n* [HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly](https://huggingface.co/papers/2410.02694) (2024)\n* [Multilingual Evaluation of Long Context Retrieval and Reasoning](https://huggingface.co/papers/2409.18006) (2024)\n* [Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding](https://huggingface.co/papers/2410.01671) (2024)\n* [HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models](https://huggingface.co/papers/2409.16191) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
\n
\n
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-10-10T01:33:32.222Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2410.04199","authors":[{"_id":"6705ecde7583ffc1e02f01f8","user":{"_id":"63024676056ec3a2a8714b24","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1661093436322-noauth.jpeg","isPro":false,"fullname":"Xiang Liu","user":"Dominic789654","type":"user"},"name":"Xiang Liu","status":"claimed_verified","statusLastChangedAt":"2024-10-09T07:37:24.402Z","hidden":false},{"_id":"6705ecde7583ffc1e02f01f9","user":{"_id":"6395f845aec00abff778ad31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6395f845aec00abff778ad31/bZkAlchSvqER1HgBKmcHI.jpeg","isPro":false,"fullname":"PeijieDong","user":"pprp","type":"user"},"name":"Peijie Dong","status":"claimed_verified","statusLastChangedAt":"2025-06-02T07:57:16.473Z","hidden":false},{"_id":"6705ecde7583ffc1e02f01fa","user":{"_id":"66545bbcc2e9c65dcc1ee57b","avatarUrl":"/avatars/120834f04f86e2f7be33c9eb6d99b1fa.svg","isPro":false,"fullname":"minghui","user":"xuminghui","type":"user"},"name":"Xuming Hu","status":"admin_assigned","statusLastChangedAt":"2024-10-09T08:59:09.897Z","hidden":false},{"_id":"6705ecde7583ffc1e02f01fb","user":{"_id":"6676935fcd0b89a0115174b0","avatarUrl":"/avatars/4caca1b672d29e787814f9a30bf20bcc.svg","isPro":false,"fullname":"Xiaowen Chu","user":"wenxinsiju","type":"user"},"name":"Xiaowen Chu","status":"admin_assigned","statusLastChangedAt":"2024-10-09T08:59:03.917Z","hidden":false}],"publishedAt":"2024-10-05T15:33:25.000Z","submittedOnDailyAt":"2024-10-09T01:11:06.253Z","title":"LongGenBench: Long-context Generation Benchmark","submittedOnDailyBy":{"_id":"63024676056ec3a2a8714b24","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1661093436322-noauth.jpeg","isPro":false,"fullname":"Xiang Liu","user":"Dominic789654","type":"user"},"summary":"Current long-context benchmarks primarily focus on retrieval-based tests,\nrequiring Large Language Models (LLMs) to locate specific information within\nextensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark.\nLong-context generation refers to the ability of a language model to generate\ncoherent and contextually accurate text that spans across lengthy passages or\ndocuments. While recent studies show strong performance on NIAH and other\nretrieval-based long-context benchmarks, there is a significant lack of\nbenchmarks for evaluating long-context generation capabilities. To bridge this\ngap and offer a comprehensive assessment, we introduce a synthetic benchmark,\nLongGenBench, which allows for flexible configurations of customized generation\ncontext lengths. LongGenBench advances beyond traditional benchmarks by\nredesigning the format of questions and necessitating that LLMs respond with a\nsingle, cohesive long-context answer. Upon extensive evaluation using\nLongGenBench, we observe that: (1) both API accessed and open source models\nexhibit performance degradation in long-context generation scenarios, ranging\nfrom 1.2% to 47.1%; (2) different series of LLMs exhibit varying trends of\nperformance degradation, with the Gemini-1.5-Flash model showing the least\ndegradation among API accessed models, and the Qwen2 series exhibiting the\nleast degradation in LongGenBench among open source models.","upvotes":22,"discussionId":"6705ecdf7583ffc1e02f0257","githubRepo":"https://github.com/dominic789654/longgenbench","githubRepoAddedBy":"auto","ai_summary":"A new benchmark, LongGenBench, evaluates the long-context generation capabilities of large language models, revealing varying performance degradation across different models and model series.","ai_keywords":["Large Language Models (LLMs)","long-context generation","needle-in-a-haystack (NIAH)","LongGenBench","synthetic benchmark","context lengths","generation context","performance degradation","Gemini-1.5-Flash","Qwen2"],"githubStars":24},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63024676056ec3a2a8714b24","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1661093436322-noauth.jpeg","isPro":false,"fullname":"Xiang Liu","user":"Dominic789654","type":"user"},{"_id":"633fd532d4fc96dd3a04ae89","avatarUrl":"/avatars/f373772e5602320f685173c94507ba42.svg","isPro":false,"fullname":"sunxin","user":"unixin","type":"user"},{"_id":"641b754d1911d3be6745cce9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641b754d1911d3be6745cce9/Ydjcjd4VuNUGj5Cd4QHdB.png","isPro":false,"fullname":"atayloraerospace","user":"Taylor658","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64b7ac8c30a0b8ff6020baa9","avatarUrl":"/avatars/90b02ae7ce5e1979f387c8ad0eec424c.svg","isPro":false,"fullname":"Alexey G","user":"grib0ed0v","type":"user"},{"_id":"6395f845aec00abff778ad31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6395f845aec00abff778ad31/bZkAlchSvqER1HgBKmcHI.jpeg","isPro":false,"fullname":"PeijieDong","user":"pprp","type":"user"},{"_id":"66a4a319a1711696948b045c","avatarUrl":"/avatars/1d92d57a949332cb8227697b9a0c2f39.svg","isPro":false,"fullname":"Zhenheng Tang","user":"coolzhtang","type":"user"},{"_id":"62deb6c3520a9fae78bb9bc3","avatarUrl":"/avatars/5d75fffa9bad36d20adb8f47141d1f0b.svg","isPro":false,"fullname":"Literate Goggles","user":"literate-goggles","type":"user"},{"_id":"6555125a4f361968f0e3aad7","avatarUrl":"/avatars/e7692d82804338f21ecdc6e731f5c5ea.svg","isPro":false,"fullname":"marinaretikof","user":"marinaretik","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"6643214539a0b42181ab83fa","avatarUrl":"/avatars/67af73c06d5be411df5f3541c6a8aaa3.svg","isPro":false,"fullname":"frank W.H.","user":"securerat","type":"user"},{"_id":"64587be872b60ae7a3817858","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64587be872b60ae7a3817858/BbdOOxOCEzWTvEpkWp8MM.png","isPro":false,"fullname":"Minbyul Jeong","user":"Minbyul","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">
Abstract
A new benchmark, LongGenBench, evaluates the long-context generation capabilities of large language models, revealing varying performance degradation across different models and model series.
Current long-context benchmarks primarily focus on retrieval-based tests,
requiring Large Language Models (LLMs) to locate specific information within
extensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark.
Long-context generation refers to the ability of a language model to generate
coherent and contextually accurate text that spans across lengthy passages or
documents. While recent studies show strong performance on NIAH and other
retrieval-based long-context benchmarks, there is a significant lack of
benchmarks for evaluating long-context generation capabilities. To bridge this
gap and offer a comprehensive assessment, we introduce a synthetic benchmark,
LongGenBench, which allows for flexible configurations of customized generation
context lengths. LongGenBench advances beyond traditional benchmarks by
redesigning the format of questions and necessitating that LLMs respond with a
single, cohesive long-context answer. Upon extensive evaluation using
LongGenBench, we observe that: (1) both API accessed and open source models
exhibit performance degradation in long-context generation scenarios, ranging
from 1.2% to 47.1%; (2) different series of LLMs exhibit varying trends of
performance degradation, with the Gemini-1.5-Flash model showing the least
degradation among API accessed models, and the Qwen2 series exhibiting the
least degradation in LongGenBench among open source models.