https://github.com/Dominic789654/LongGenBench

\n","updatedAt":"2024-10-09T02:41:06.257Z","author":{"_id":"63024676056ec3a2a8714b24","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1661093436322-noauth.jpeg","fullname":"Xiang Liu","name":"Dominic789654","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"editors":["Dominic789654"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1661093436322-noauth.jpeg"],"reactions":[{"reaction":"👍","users":["AdinaY","marinaretik"],"count":2},{"reaction":"🚀","users":["AdinaY"],"count":1}],"isReport":false}},{"id":"6706a5a6d76eecf2c49af535","author":{"_id":"664c96529543c9c6e949c2dd","avatarUrl":"/avatars/e270d47f8e18dac66f86aadc96b0730a.svg","fullname":"joe kostikov","name":"Slitherysloth","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-10-09T15:47:50.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"conda create -yn LongGenBench python=3.9 \nconda activate LongGenBench\npip install -r requirements.txt\n","html":"

conda create -yn LongGenBench python=3.9
conda activate LongGenBench
pip install -r requirements.txt

\n","updatedAt":"2024-10-09T15:47:50.685Z","author":{"_id":"664c96529543c9c6e949c2dd","avatarUrl":"/avatars/e270d47f8e18dac66f86aadc96b0730a.svg","fullname":"joe kostikov","name":"Slitherysloth","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"editors":["Slitherysloth"],"editorAvatarUrls":["/avatars/e270d47f8e18dac66f86aadc96b0730a.svg"],"reactions":[],"isReport":false}},{"id":"67072eec24f097e7f71d5b3b","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-10-10T01:33:32.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs](https://huggingface.co/papers/2409.02076) (2024)\n* [HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly](https://huggingface.co/papers/2410.02694) (2024)\n* [Multilingual Evaluation of Long Context Retrieval and Reasoning](https://huggingface.co/papers/2409.18006) (2024)\n* [Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding](https://huggingface.co/papers/2410.01671) (2024)\n* [HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models](https://huggingface.co/papers/2409.16191) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-10-10T01:33:32.222Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2410.04199","authors":[{"_id":"6705ecde7583ffc1e02f01f8","user":{"_id":"63024676056ec3a2a8714b24","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1661093436322-noauth.jpeg","isPro":false,"fullname":"Xiang Liu","user":"Dominic789654","type":"user"},"name":"Xiang Liu","status":"claimed_verified","statusLastChangedAt":"2024-10-09T07:37:24.402Z","hidden":false},{"_id":"6705ecde7583ffc1e02f01f9","user":{"_id":"6395f845aec00abff778ad31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6395f845aec00abff778ad31/bZkAlchSvqER1HgBKmcHI.jpeg","isPro":false,"fullname":"PeijieDong","user":"pprp","type":"user"},"name":"Peijie Dong","status":"claimed_verified","statusLastChangedAt":"2025-06-02T07:57:16.473Z","hidden":false},{"_id":"6705ecde7583ffc1e02f01fa","user":{"_id":"66545bbcc2e9c65dcc1ee57b","avatarUrl":"/avatars/120834f04f86e2f7be33c9eb6d99b1fa.svg","isPro":false,"fullname":"minghui","user":"xuminghui","type":"user"},"name":"Xuming Hu","status":"admin_assigned","statusLastChangedAt":"2024-10-09T08:59:09.897Z","hidden":false},{"_id":"6705ecde7583ffc1e02f01fb","user":{"_id":"6676935fcd0b89a0115174b0","avatarUrl":"/avatars/4caca1b672d29e787814f9a30bf20bcc.svg","isPro":false,"fullname":"Xiaowen Chu","user":"wenxinsiju","type":"user"},"name":"Xiaowen Chu","status":"admin_assigned","statusLastChangedAt":"2024-10-09T08:59:03.917Z","hidden":false}],"publishedAt":"2024-10-05T15:33:25.000Z","submittedOnDailyAt":"2024-10-09T01:11:06.253Z","title":"LongGenBench: Long-context Generation Benchmark","submittedOnDailyBy":{"_id":"63024676056ec3a2a8714b24","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1661093436322-noauth.jpeg","isPro":false,"fullname":"Xiang Liu","user":"Dominic789654","type":"user"},"summary":"Current long-context benchmarks primarily focus on retrieval-based tests,\nrequiring Large Language Models (LLMs) to locate specific information within\nextensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark.\nLong-context generation refers to the ability of a language model to generate\ncoherent and contextually accurate text that spans across lengthy passages or\ndocuments. While recent studies show strong performance on NIAH and other\nretrieval-based long-context benchmarks, there is a significant lack of\nbenchmarks for evaluating long-context generation capabilities. To bridge this\ngap and offer a comprehensive assessment, we introduce a synthetic benchmark,\nLongGenBench, which allows for flexible configurations of customized generation\ncontext lengths. LongGenBench advances beyond traditional benchmarks by\nredesigning the format of questions and necessitating that LLMs respond with a\nsingle, cohesive long-context answer. Upon extensive evaluation using\nLongGenBench, we observe that: (1) both API accessed and open source models\nexhibit performance degradation in long-context generation scenarios, ranging\nfrom 1.2% to 47.1%; (2) different series of LLMs exhibit varying trends of\nperformance degradation, with the Gemini-1.5-Flash model showing the least\ndegradation among API accessed models, and the Qwen2 series exhibiting the\nleast degradation in LongGenBench among open source models.","upvotes":22,"discussionId":"6705ecdf7583ffc1e02f0257","githubRepo":"https://github.com/dominic789654/longgenbench","githubRepoAddedBy":"auto","ai_summary":"A new benchmark, LongGenBench, evaluates the long-context generation capabilities of large language models, revealing varying performance degradation across different models and model series.","ai_keywords":["Large Language Models (LLMs)","long-context generation","needle-in-a-haystack (NIAH)","LongGenBench","synthetic benchmark","context lengths","generation context","performance degradation","Gemini-1.5-Flash","Qwen2"],"githubStars":24},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63024676056ec3a2a8714b24","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1661093436322-noauth.jpeg","isPro":false,"fullname":"Xiang Liu","user":"Dominic789654","type":"user"},{"_id":"633fd532d4fc96dd3a04ae89","avatarUrl":"/avatars/f373772e5602320f685173c94507ba42.svg","isPro":false,"fullname":"sunxin","user":"unixin","type":"user"},{"_id":"641b754d1911d3be6745cce9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641b754d1911d3be6745cce9/Ydjcjd4VuNUGj5Cd4QHdB.png","isPro":false,"fullname":"atayloraerospace","user":"Taylor658","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64b7ac8c30a0b8ff6020baa9","avatarUrl":"/avatars/90b02ae7ce5e1979f387c8ad0eec424c.svg","isPro":false,"fullname":"Alexey G","user":"grib0ed0v","type":"user"},{"_id":"6395f845aec00abff778ad31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6395f845aec00abff778ad31/bZkAlchSvqER1HgBKmcHI.jpeg","isPro":false,"fullname":"PeijieDong","user":"pprp","type":"user"},{"_id":"66a4a319a1711696948b045c","avatarUrl":"/avatars/1d92d57a949332cb8227697b9a0c2f39.svg","isPro":false,"fullname":"Zhenheng Tang","user":"coolzhtang","type":"user"},{"_id":"62deb6c3520a9fae78bb9bc3","avatarUrl":"/avatars/5d75fffa9bad36d20adb8f47141d1f0b.svg","isPro":false,"fullname":"Literate Goggles","user":"literate-goggles","type":"user"},{"_id":"6555125a4f361968f0e3aad7","avatarUrl":"/avatars/e7692d82804338f21ecdc6e731f5c5ea.svg","isPro":false,"fullname":"marinaretikof","user":"marinaretik","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"6643214539a0b42181ab83fa","avatarUrl":"/avatars/67af73c06d5be411df5f3541c6a8aaa3.svg","isPro":false,"fullname":"frank W.H.","user":"securerat","type":"user"},{"_id":"64587be872b60ae7a3817858","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64587be872b60ae7a3817858/BbdOOxOCEzWTvEpkWp8MM.png","isPro":false,"fullname":"Minbyul Jeong","user":"Minbyul","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">

Papers

arxiv:2410.04199

LongGenBench: Long-context Generation Benchmark

Published on Oct 5, 2024

· Submitted by

Xiang Liu on Oct 9, 2024

#1 Paper of the day

Upvote

Authors:

Xiang Liu ,

Peijie Dong ,

Xuming Hu ,

Xiaowen Chu

Abstract

A new benchmark, LongGenBench, evaluates the long-context generation capabilities of large language models, revealing varying performance degradation across different models and model series.

AI-generated summary

Current long-context benchmarks primarily focus on retrieval-based tests, requiring Large Language Models (LLMs) to locate specific information within extensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark. Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text that spans across lengthy passages or documents. While recent studies show strong performance on NIAH and other retrieval-based long-context benchmarks, there is a significant lack of benchmarks for evaluating long-context generation capabilities. To bridge this gap and offer a comprehensive assessment, we introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths. LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating that LLMs respond with a single, cohesive long-context answer. Upon extensive evaluation using LongGenBench, we observe that: (1) both API accessed and open source models exhibit performance degradation in long-context generation scenarios, ranging from 1.2% to 47.1%; (2) different series of LLMs exhibit varying trends of performance degradation, with the Gemini-1.5-Flash model showing the least degradation among API accessed models, and the Qwen2 series exhibiting the least degradation in LongGenBench among open source models.

View arXiv page View PDF GitHub 24 auto Add to collection

Community

Dominic789654

Paper author Paper submitter Oct 9, 2024

[EMNLP 2024 Findings]
Current long-context benchmarks primarily focus on retrieval-based tests, requiring Large Language Models (LLMs) to locate specific information within extensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark. Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text that spans across lengthy passages or documents. While recent studies show strong performance on NIAH and other retrieval-based long-context benchmarks, there is a significant lack of benchmarks for evaluating long-context generation capabilities. To bridge this gap and offer a comprehensive assessment, we introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths. LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating that LLMs respond with a single, cohesive long-context answer. Upon extensive evaluation using LongGenBench, we observe that: (1) both API accessed and open source models exhibit performance degradation in long-context generation scenarios, ranging from 1.2% to 47.1%; (2) different series of LLMs exhibit varying trends of performance degradation, with the Gemini-1.5-Flash model showing the least degradation among API accessed models, and the Qwen2 series exhibiting the least degradation in LongGenBench among open source models.

code: https://github.com/Dominic789654/LongGenBench