Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - YourBench: Easy Custom Evaluation Sets for Everyone
[go: Go Back, main page]

https://huggingface.co/spaces/yourbench/demo

\n","updatedAt":"2025-04-03T01:11:06.419Z","author":{"_id":"64676c81e7a6a374fd181110","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64676c81e7a6a374fd181110/HEDlAT9FYhJF9Nw__LLNI.jpeg","fullname":"Sumuk Shashidhar","name":"sumuks","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":35,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9282257556915283},"editors":["sumuks"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64676c81e7a6a374fd181110/HEDlAT9FYhJF9Nw__LLNI.jpeg"],"reactions":[],"isReport":false},"replies":[{"id":"67eea25a04be7fba0c5ec1dd","author":{"_id":"647777304ae93470ffc28913","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/BywZYAPzsyBSCf8yJFiUf.jpeg","fullname":"Salma Mayorquin","name":"salma-remyx","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":102,"isUserFollowing":false},"createdAt":"2025-04-03T14:59:38.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Congrats on the feature release @sumuks \n\nBut data generation and judges for task-specific skills assessment have been around for at least a year, people are struggling to choose the best foundation model for their application.\n\nThat's because offline evaluation techniques like this are unreliable predictors for how users will respond to changes in the app. It's a nuanced engineering decision considering factors such as latency or other real-world constraints to deploying the model.\n\nIn fact, most \"great ideas\" for software changes across industry fail to meaningfully impact business metrics and user engagement when measured with online evaluation methods like A/B testing.\n\nTools like this may help in the ramp up to the real evaluation, but only if you're applying the scientific method to identify these predictors, closing the loop in your AI development and evaluation.\n\nWrote more on trustworthy AI experiments here: https://www.remyx.ai/blog/trustworthy-ai-experiments","html":"

Congrats on the feature release \n\n@sumuks\n\t

\n

But data generation and judges for task-specific skills assessment have been around for at least a year, people are struggling to choose the best foundation model for their application.

\n

That's because offline evaluation techniques like this are unreliable predictors for how users will respond to changes in the app. It's a nuanced engineering decision considering factors such as latency or other real-world constraints to deploying the model.

\n

In fact, most \"great ideas\" for software changes across industry fail to meaningfully impact business metrics and user engagement when measured with online evaluation methods like A/B testing.

\n

Tools like this may help in the ramp up to the real evaluation, but only if you're applying the scientific method to identify these predictors, closing the loop in your AI development and evaluation.

\n

Wrote more on trustworthy AI experiments here: https://www.remyx.ai/blog/trustworthy-ai-experiments

\n","updatedAt":"2025-04-03T14:59:38.254Z","author":{"_id":"647777304ae93470ffc28913","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/BywZYAPzsyBSCf8yJFiUf.jpeg","fullname":"Salma Mayorquin","name":"salma-remyx","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":102,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9215332269668579},"editors":["salma-remyx"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/BywZYAPzsyBSCf8yJFiUf.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"67ede02afd915b0739de843c"}}]},{"id":"67ef38151de277aadd52817b","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-04-04T01:38:29.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [WritingBench: A Comprehensive Benchmark for Generative Writing](https://huggingface.co/papers/2503.05244) (2025)\n* [Dynamic-KGQA: A Scalable Framework for Generating Adaptive Question Answering Datasets](https://huggingface.co/papers/2503.05049) (2025)\n* [Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models](https://huggingface.co/papers/2504.01001) (2025)\n* [Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models](https://huggingface.co/papers/2502.15799) (2025)\n* [LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs - No Silver Bullet for LC or RAG Routing](https://huggingface.co/papers/2502.09977) (2025)\n* [Adaptively evaluating models with task elicitation](https://huggingface.co/papers/2503.01986) (2025)\n* [TLUE: A Tibetan Language Understanding Evaluation Benchmark](https://huggingface.co/papers/2503.12051) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-04-04T01:38:29.498Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7180175185203552},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2504.01833","authors":[{"_id":"67eddf485224e5e2c1a39491","user":{"_id":"64676c81e7a6a374fd181110","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64676c81e7a6a374fd181110/HEDlAT9FYhJF9Nw__LLNI.jpeg","isPro":true,"fullname":"Sumuk Shashidhar","user":"sumuks","type":"user"},"name":"Sumuk Shashidhar","status":"extracted_confirmed","statusLastChangedAt":"2025-05-31T02:59:18.819Z","hidden":false},{"_id":"67eddf485224e5e2c1a39492","user":{"_id":"6202a599216215a22221dea9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1644340617257-noauth.png","isPro":false,"fullname":"Clémentine Fourrier","user":"clefourrier","type":"user"},"name":"Clémentine Fourrier","status":"extracted_pending","statusLastChangedAt":"2025-04-03T01:07:21.792Z","hidden":false},{"_id":"67eddf485224e5e2c1a39493","user":{"_id":"63f5010dfcf95ecac2ad8652","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63f5010dfcf95ecac2ad8652/vmRox4fcHMjT1y2bidjOL.jpeg","isPro":false,"fullname":"Alina Lozovskaya","user":"alozowski","type":"user"},"name":"Alina Lozovskia","status":"claimed_verified","statusLastChangedAt":"2025-12-09T09:39:36.767Z","hidden":false},{"_id":"67eddf485224e5e2c1a39494","user":{"_id":"5df7e9e5da6d0311fd3d53f9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583857746553-5df7e9e5da6d0311fd3d53f9.jpeg","isPro":true,"fullname":"Thomas Wolf","user":"thomwolf","type":"user"},"name":"Thomas Wolf","status":"claimed_verified","statusLastChangedAt":"2025-04-08T06:57:10.557Z","hidden":false},{"_id":"67eddf485224e5e2c1a39495","name":"Gokhan Tur","hidden":false},{"_id":"67eddf485224e5e2c1a39496","user":{"_id":"6747519282137543033451d9","avatarUrl":"/avatars/8bd763f19d8e9f97f0f6817435b062dc.svg","isPro":false,"fullname":"Dilek Hakkani-Tur","user":"dilekht","type":"user"},"name":"Dilek Hakkani-Tür","status":"claimed_verified","statusLastChangedAt":"2025-06-16T15:51:20.344Z","hidden":false}],"publishedAt":"2025-04-02T15:40:24.000Z","submittedOnDailyAt":"2025-04-02T23:41:06.369Z","title":"YourBench: Easy Custom Evaluation Sets for Everyone","submittedOnDailyBy":{"_id":"64676c81e7a6a374fd181110","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64676c81e7a6a374fd181110/HEDlAT9FYhJF9Nw__LLNI.jpeg","isPro":true,"fullname":"Sumuk Shashidhar","user":"sumuks","type":"user"},"summary":"Evaluating large language models (LLMs) effectively remains a critical\nbottleneck, as traditional static benchmarks suffer from saturation and\ncontamination, while human evaluations are costly and slow. This hinders timely\nor domain-specific assessment, crucial for real-world applications. We\nintroduce YourBench, a novel, open-source framework that addresses these\nlimitations by enabling dynamic, automated generation of reliable, up-to-date,\nand domain-tailored benchmarks cheaply and without manual annotation, directly\nfrom user-provided documents. We demonstrate its efficacy by replicating 7\ndiverse MMLU subsets using minimal source text, achieving this for under 15 USD\nin total inference costs while perfectly preserving the relative model\nperformance rankings (Spearman Rho = 1) observed on the original benchmark. To\nensure that YourBench generates data grounded in provided input instead of\nrelying on posterior parametric knowledge in models, we also introduce\nTempora-0325, a novel dataset of over 7K diverse documents, published\nexclusively after March 2025. Our comprehensive analysis spans 26 SoTA models\nfrom 7 major families across varying scales (3-671B parameters) to validate the\nquality of generated evaluations through rigorous algorithmic checks (e.g.,\ncitation grounding) and human assessments. We release the YourBench library,\nthe Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all\nevaluation and inference traces to facilitate reproducible research and empower\nthe community to generate bespoke benchmarks on demand, fostering more relevant\nand trustworthy LLM evaluation.","upvotes":22,"discussionId":"67eddf495224e5e2c1a394eb","projectPage":"https://yourbench.github.io","githubRepo":"https://github.com/huggingface/yourbench","githubRepoAddedBy":"user","ai_summary":"YourBench introduces an automated framework for generating domain-specific LLM benchmarks dynamically and cost-effectively from user documents, ensuring reliable and up-to-date evaluations.","ai_keywords":["large language models","YourBench","dynamic benchmarks","automated generation","domain-tailored benchmarks","Tempora-0325","MMLU subsets","Tempora-0325 dataset","citation grounding","SoTA models","parameter-efficient fine-tuning","reproducible research"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64676c81e7a6a374fd181110","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64676c81e7a6a374fd181110/HEDlAT9FYhJF9Nw__LLNI.jpeg","isPro":true,"fullname":"Sumuk Shashidhar","user":"sumuks","type":"user"},{"_id":"65decc75beffeb39ba679eba","avatarUrl":"/avatars/735b678bd5863a0c1b1bdd3bbf8858fa.svg","isPro":true,"fullname":"r","user":"oceansweep","type":"user"},{"_id":"6202a599216215a22221dea9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1644340617257-noauth.png","isPro":false,"fullname":"Clémentine Fourrier","user":"clefourrier","type":"user"},{"_id":"6270d2ddbcef985363d774fa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270d2ddbcef985363d774fa/HOKAxx_FKVRF-87WpGQbF.png","isPro":true,"fullname":"jiakai","user":"real-jiakai","type":"user"},{"_id":"6266f2232d21f7cfcf35b884","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1650913783734-noauth.png","isPro":false,"fullname":"Troy Baker","user":"jtroybaker","type":"user"},{"_id":"61d375fd733d3a83ecd1bba9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61d375fd733d3a83ecd1bba9/oIXwvvs1-HaCnJXMCZgkc.jpeg","isPro":false,"fullname":"Andrew Reed","user":"andrewrreed","type":"user"},{"_id":"5df7e9e5da6d0311fd3d53f9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583857746553-5df7e9e5da6d0311fd3d53f9.jpeg","isPro":true,"fullname":"Thomas Wolf","user":"thomwolf","type":"user"},{"_id":"6192895f3b8aa351a46fadfd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6192895f3b8aa351a46fadfd/2VifD-AAKYk24AUmfSr_X.png","isPro":true,"fullname":"Célina Hanouti","user":"celinah","type":"user"},{"_id":"6309eac04bee441367ad8d6e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6309eac04bee441367ad8d6e/GTMkCZNuv226BtWPEdET8.jpeg","isPro":true,"fullname":"Wassim Trabelsi","user":"wath5","type":"user"},{"_id":"64461026e1fd8d65b27e6187","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64461026e1fd8d65b27e6187/i6UH81YW86iMpg9kkMdPG.jpeg","isPro":false,"fullname":"David Quispe","user":"daqc","type":"user"},{"_id":"5fcaabed246881afd5b00167","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1651847561574-5fcaabed246881afd5b00167.jpeg","isPro":false,"fullname":"Muhtasham Oblokulov","user":"muhtasham","type":"user"},{"_id":"5f43448a79c1ba4c353d0d8f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5f43448a79c1ba4c353d0d8f/DiSygV3dn7A_OjmGVTrHD.jpeg","isPro":true,"fullname":"Sugato Ray","user":"sugatoray","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2504.01833

YourBench: Easy Custom Evaluation Sets for Everyone

Published on Apr 2, 2025
· Submitted by
Sumuk Shashidhar
on Apr 2, 2025

Abstract

YourBench introduces an automated framework for generating domain-specific LLM benchmarks dynamically and cost-effectively from user documents, ensuring reliable and up-to-date evaluations.

AI-generated summary

Evaluating large language models (LLMs) effectively remains a critical bottleneck, as traditional static benchmarks suffer from saturation and contamination, while human evaluations are costly and slow. This hinders timely or domain-specific assessment, crucial for real-world applications. We introduce YourBench, a novel, open-source framework that addresses these limitations by enabling dynamic, automated generation of reliable, up-to-date, and domain-tailored benchmarks cheaply and without manual annotation, directly from user-provided documents. We demonstrate its efficacy by replicating 7 diverse MMLU subsets using minimal source text, achieving this for under 15 USD in total inference costs while perfectly preserving the relative model performance rankings (Spearman Rho = 1) observed on the original benchmark. To ensure that YourBench generates data grounded in provided input instead of relying on posterior parametric knowledge in models, we also introduce Tempora-0325, a novel dataset of over 7K diverse documents, published exclusively after March 2025. Our comprehensive analysis spans 26 SoTA models from 7 major families across varying scales (3-671B parameters) to validate the quality of generated evaluations through rigorous algorithmic checks (e.g., citation grounding) and human assessments. We release the YourBench library, the Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all evaluation and inference traces to facilitate reproducible research and empower the community to generate bespoke benchmarks on demand, fostering more relevant and trustworthy LLM evaluation.

Community

Paper author Paper submitter

we're launching 🤗 yourbench today, an open source tool for custom benchmarking and synthetic data generation from ANY of your documents. it's a big step towards improving how model evaluations work

most benchmarks test general capabilities, but we know that for many use cases what really matters is how well a model performs your specific task. yourbench lets you evaluate models on what matters to you.

you can try it with your own docs today!: https://huggingface.co/spaces/yourbench/demo

·

Congrats on the feature release @sumuks

But data generation and judges for task-specific skills assessment have been around for at least a year, people are struggling to choose the best foundation model for their application.

That's because offline evaluation techniques like this are unreliable predictors for how users will respond to changes in the app. It's a nuanced engineering decision considering factors such as latency or other real-world constraints to deploying the model.

In fact, most "great ideas" for software changes across industry fail to meaningfully impact business metrics and user engagement when measured with online evaluation methods like A/B testing.

Tools like this may help in the ramp up to the real evaluation, but only if you're applying the scientific method to identify these predictors, closing the loop in your AI development and evaluation.

Wrote more on trustworthy AI experiments here: https://www.remyx.ai/blog/trustworthy-ai-experiments

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.01833 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.01833 in a Space README.md to link it from this page.

Collections including this paper 6