Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
\n","updatedAt":"2025-04-23T01:44:17.476Z","author":{"_id":"62d4bf8c97ab9eb08762a975","avatarUrl":"/avatars/73c6228e317cf37b4e3c3e7a4b3d8ae8.svg","fullname":"Minghao Wu","name":"minghaowu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":17,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8834763765335083},"editors":["minghaowu"],"editorAvatarUrls":["/avatars/73c6228e317cf37b4e3c3e7a4b3d8ae8.svg"],"reactions":[],"isReport":false}},{"id":"68099594df384ed9e72b8855","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false},"createdAt":"2025-04-24T01:36:20.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation](https://huggingface.co/papers/2504.07072) (2025)\n* [M-Prometheus: A Suite of Open Multilingual LLM Judges](https://huggingface.co/papers/2504.04953) (2025)\n* [GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models](https://huggingface.co/papers/2504.04155) (2025)\n* [MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation](https://huggingface.co/papers/2503.10497) (2025)\n* [WebFAQ: A Multilingual Collection of Natural Q&A Datasets for Dense Retrieval](https://huggingface.co/papers/2502.20936) (2025)\n* [Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers](https://huggingface.co/papers/2503.00865) (2025)\n* [XIFBench: Evaluating Large Language Models on Multilingual Instruction Following](https://huggingface.co/papers/2503.07539) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-04-24T01:36:20.035Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.669543981552124},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2504.15521","authors":[{"_id":"6808458f07e80b69b2df2440","user":{"_id":"62d4bf8c97ab9eb08762a975","avatarUrl":"/avatars/73c6228e317cf37b4e3c3e7a4b3d8ae8.svg","isPro":false,"fullname":"Minghao Wu","user":"minghaowu","type":"user"},"name":"Minghao Wu","status":"admin_assigned","statusLastChangedAt":"2025-04-23T13:03:02.862Z","hidden":false},{"_id":"6808458f07e80b69b2df2441","user":{"_id":"675c52ecdea6ceb2f6ce0ea3","avatarUrl":"/avatars/b5ea0e2bc5b001446f62343a907e95f1.svg","isPro":false,"fullname":"weixuan wang","user":"yourdadishere","type":"user"},"name":"Weixuan Wang","status":"admin_assigned","statusLastChangedAt":"2025-04-23T13:03:08.988Z","hidden":false},{"_id":"6808458f07e80b69b2df2442","user":{"_id":"635504620d6e89270d440050","avatarUrl":"/avatars/3790bf4a68f943a122af59b1362b07f2.svg","isPro":false,"fullname":"LiuSinuo","user":"SNF","type":"user"},"name":"Sinuo Liu","status":"admin_assigned","statusLastChangedAt":"2025-04-23T13:03:19.254Z","hidden":false},{"_id":"6808458f07e80b69b2df2443","name":"Huifeng Yin","hidden":false},{"_id":"6808458f07e80b69b2df2444","user":{"_id":"674879885d97f2d66a14006a","avatarUrl":"/avatars/cdeff44a898560cb460223f30922b081.svg","isPro":false,"fullname":"Xintong Wang","user":"shanewang","type":"user"},"name":"Xintong Wang","status":"claimed_verified","statusLastChangedAt":"2025-05-12T06:51:30.134Z","hidden":false},{"_id":"6808458f07e80b69b2df2445","name":"Yu Zhao","hidden":false},{"_id":"6808458f07e80b69b2df2446","user":{"_id":"6527d8b077bceabaab382a75","avatarUrl":"/avatars/69caacf9153dbf6a3796693a968b363f.svg","isPro":false,"fullname":"Chenyang Lyu","user":"ChenyangLyu","type":"user"},"name":"Chenyang Lyu","status":"claimed_verified","statusLastChangedAt":"2025-04-23T08:28:16.770Z","hidden":false},{"_id":"6808458f07e80b69b2df2447","user":{"_id":"636b030c328133bdb3a523bc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/636b030c328133bdb3a523bc/f-OdbqqHiywkxQF1KVCLp.jpeg","isPro":false,"fullname":"Longyue Wang","user":"longyuewang","type":"user"},"name":"Longyue Wang","status":"admin_assigned","statusLastChangedAt":"2025-04-23T13:04:14.454Z","hidden":false},{"_id":"6808458f07e80b69b2df2448","user":{"_id":"66b03cedd59c09785e39711e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/N5yfQBSP3oAKPCz4ylR09.png","isPro":false,"fullname":"Weihua Luo","user":"acecamel1977","type":"user"},"name":"Weihua Luo","status":"admin_assigned","statusLastChangedAt":"2025-04-23T13:04:20.964Z","hidden":false},{"_id":"6808458f07e80b69b2df2449","user":{"_id":"63f87ebadf053017d1acbfdd","avatarUrl":"/avatars/e497ba5f41a2587837b4a6118d9367bb.svg","isPro":false,"fullname":"Kaifu Zhang","user":"zhangkaifu314","type":"user"},"name":"Kaifu Zhang","status":"admin_assigned","statusLastChangedAt":"2025-04-23T13:04:27.022Z","hidden":false}],"publishedAt":"2025-04-22T01:47:37.000Z","submittedOnDailyAt":"2025-04-23T00:13:52.385Z","title":"The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks","submittedOnDailyBy":{"_id":"62d4bf8c97ab9eb08762a975","avatarUrl":"/avatars/73c6228e317cf37b4e3c3e7a4b3d8ae8.svg","isPro":false,"fullname":"Minghao Wu","user":"minghaowu","type":"user"},"summary":"As large language models (LLMs) continue to advance in linguistic\ncapabilities, robust multilingual evaluation has become essential for promoting\nequitable technological progress. This position paper examines over 2,000\nmultilingual (non-English) benchmarks from 148 countries, published between\n2021 and 2024, to evaluate past, present, and future practices in multilingual\nbenchmarking. Our findings reveal that, despite significant investments\namounting to tens of millions of dollars, English remains significantly\noverrepresented in these benchmarks. Additionally, most benchmarks rely on\noriginal language content rather than translations, with the majority sourced\nfrom high-resource countries such as China, India, Germany, the UK, and the\nUSA. Furthermore, a comparison of benchmark performance with human judgments\nhighlights notable disparities. STEM-related tasks exhibit strong correlations\nwith human evaluations (0.70 to 0.85), while traditional NLP tasks like\nquestion answering (e.g., XQuAD) show much weaker correlations (0.11 to 0.30).\nMoreover, translating English benchmarks into other languages proves\ninsufficient, as localized benchmarks demonstrate significantly higher\nalignment with local human judgments (0.68) than their translated counterparts\n(0.47). This underscores the importance of creating culturally and\nlinguistically tailored benchmarks rather than relying solely on translations.\nThrough this comprehensive analysis, we highlight six key limitations in\ncurrent multilingual evaluation practices, propose the guiding principles\naccordingly for effective multilingual benchmarking, and outline five critical\nresearch directions to drive progress in the field. Finally, we call for a\nglobal collaborative effort to develop human-aligned benchmarks that prioritize\nreal-world applications.","upvotes":64,"discussionId":"6808459007e80b69b2df249e","ai_summary":"Research reveals significant disparities in multilingual benchmark evaluations, emphasizing the need for culturally and linguistically tailored benchmarks over translations to achieve equitable technological progress.","ai_keywords":[""]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66bf01fed7a9770138967d7f","avatarUrl":"/avatars/1d9dd0ee1f383aeb1b7c5c903666a556.svg","isPro":false,"fullname":"TonySilva","user":"TonySilva423","type":"user"},{"_id":"637b5af1c048d163679f67f6","avatarUrl":"/avatars/865b2fc050cd44038ee48fb291845076.svg","isPro":false,"fullname":"GuoFeng Project","user":"guofeng-project","type":"user"},{"_id":"6742deb4d3ad4510c12da658","avatarUrl":"/avatars/91407d854560ef9a2facd80fa8fab6ec.svg","isPro":false,"fullname":"Kechen Li","user":"Kechen-Li","type":"user"},{"_id":"636b030c328133bdb3a523bc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/636b030c328133bdb3a523bc/f-OdbqqHiywkxQF1KVCLp.jpeg","isPro":false,"fullname":"Longyue Wang","user":"longyuewang","type":"user"},{"_id":"62d4bf8c97ab9eb08762a975","avatarUrl":"/avatars/73c6228e317cf37b4e3c3e7a4b3d8ae8.svg","isPro":false,"fullname":"Minghao Wu","user":"minghaowu","type":"user"},{"_id":"642656cbad1e3b0e6e91b752","avatarUrl":"/avatars/3bf0ee15fd528e09b2b889f5cce3cbd0.svg","isPro":false,"fullname":"Jie Zhu","user":"amazingj","type":"user"},{"_id":"67c8af4885b7a86d7515b77c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/4bRu0MKg6voGX5fyqninn.png","isPro":false,"fullname":"Lorenzo Xiao","user":"lrzneedresearch","type":"user"},{"_id":"64fa937114636d417a87e2ff","avatarUrl":"/avatars/3ca99b55bc920cf868657ec947e86a3f.svg","isPro":false,"fullname":"Haolan Zhan","user":"zhanhaolan","type":"user"},{"_id":"64a3d40815655921915b8ce2","avatarUrl":"/avatars/6b6b550d96be4a6473e2ccf74df438f7.svg","isPro":false,"fullname":"Jianhuipang","user":"pangjh3","type":"user"},{"_id":"5fb2d92a9f63b546e74cb399","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1653523921526-5fb2d92a9f63b546e74cb399.png","isPro":false,"fullname":"chiyu_zhang","user":"chiyuzhang","type":"user"},{"_id":"638439ca834d3558a398d035","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1669609868550-noauth.png","isPro":false,"fullname":"Zhiwei He","user":"zwhe99","type":"user"},{"_id":"63525f2156ef05f3a1f52362","avatarUrl":"/avatars/0748e51ff76d044dc425044e208b8342.svg","isPro":false,"fullname":"Wenxuan Wang","user":"JarvisWang","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Research reveals significant disparities in multilingual benchmark evaluations, emphasizing the need for culturally and linguistically tailored benchmarks over translations to achieve equitable technological progress.
AI-generated summary
As large language models (LLMs) continue to advance in linguistic
capabilities, robust multilingual evaluation has become essential for promoting
equitable technological progress. This position paper examines over 2,000
multilingual (non-English) benchmarks from 148 countries, published between
2021 and 2024, to evaluate past, present, and future practices in multilingual
benchmarking. Our findings reveal that, despite significant investments
amounting to tens of millions of dollars, English remains significantly
overrepresented in these benchmarks. Additionally, most benchmarks rely on
original language content rather than translations, with the majority sourced
from high-resource countries such as China, India, Germany, the UK, and the
USA. Furthermore, a comparison of benchmark performance with human judgments
highlights notable disparities. STEM-related tasks exhibit strong correlations
with human evaluations (0.70 to 0.85), while traditional NLP tasks like
question answering (e.g., XQuAD) show much weaker correlations (0.11 to 0.30).
Moreover, translating English benchmarks into other languages proves
insufficient, as localized benchmarks demonstrate significantly higher
alignment with local human judgments (0.68) than their translated counterparts
(0.47). This underscores the importance of creating culturally and
linguistically tailored benchmarks rather than relying solely on translations.
Through this comprehensive analysis, we highlight six key limitations in
current multilingual evaluation practices, propose the guiding principles
accordingly for effective multilingual benchmarking, and outline five critical
research directions to drive progress in the field. Finally, we call for a
global collaborative effort to develop human-aligned benchmarks that prioritize
real-world applications.