Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
OpenEvals (OpenEvals)
[go: Go Back, main page]

OpenEvals

community
Activity Feed
evaluation guidebook, your reference for LLM evals \n
  • lighteval LLM evaluation suite, fast and filled with the SOTA benchmarks you might want
  • \n
  • leaderboards on the hub initiative, to encourage people to build more leaderboards in the open for more reproducible evaluation. You'll find some doc here to build your own, and you can look for the best leaderboard for your use case here!
  • \n\n

    Our archived projects:

    \n\n

    We're not behind the evaluate metrics guide but if you want to understand metrics better we really recommend checking it out!

    \n","classNames":"hf-sanitized hf-sanitized-pK58KlhRYmpcUXDKEpQ90"},"users":[{"_id":"6202a599216215a22221dea9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1644340617257-noauth.png","isPro":false,"fullname":"ClΓ©mentine Fourrier","user":"clefourrier","type":"user"},{"_id":"5df7e9e5da6d0311fd3d53f9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583857746553-5df7e9e5da6d0311fd3d53f9.jpeg","isPro":true,"fullname":"Thomas Wolf","user":"thomwolf","type":"user"},{"_id":"63e0eea7af523c37e5a77966","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678663263366-63e0eea7af523c37e5a77966.jpeg","isPro":true,"fullname":"Nathan Habib","user":"SaylorTwift","type":"user"},{"_id":"63f5010dfcf95ecac2ad8652","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63f5010dfcf95ecac2ad8652/vmRox4fcHMjT1y2bidjOL.jpeg","isPro":false,"fullname":"Alina Lozovskaya","user":"alozowski","type":"user"},{"_id":"64cb7fdb9e30a46f7b92aa45","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64cb7fdb9e30a46f7b92aa45/TKaRtn_-R_W__QY8DsQv3.jpeg","isPro":true,"fullname":"frere thibaud","user":"tfrere","type":"user"},{"_id":"680ff4388f704be391757780","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/WEjkuS_TxIgtYRxNPa0VS.png","isPro":false,"fullname":"Georgia Channing","user":"cgeorgiaw","type":"user"},{"_id":"5e48005437cb5b49818287a5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e48005437cb5b49818287a5/4uCXGGui-9QifAT4qelxU.png","isPro":false,"fullname":"Leandro von Werra","user":"lvwerra","type":"user"},{"_id":"61c141342aac764ce1654e43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61c141342aac764ce1654e43/81AwoT5IQ_Xdw0OVw7TKu.jpeg","isPro":false,"fullname":"Loubna Ben Allal","user":"loubnabnl","type":"user"}],"userCount":8,"collections":[{"slug":"OpenEvals/research-collaborations-67c1824c775d5aed68071aa2","title":"Research collaborations","description":"A small overview of our research collabs through the years","gating":false,"lastUpdated":"2025-10-07T09:49:28.613Z","owner":{"_id":"67bed722e00731308d6a506d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6202a599216215a22221dea9/LHg8qch52zd2GAxCQvPoq.png","fullname":"OpenEvals","name":"OpenEvals","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":153,"isUserFollowing":false},"items":[{"_id":"67c1828264f4f5480b63ae82","position":0,"type":"paper","id":"2311.12983","title":"GAIA: a benchmark for General AI Assistants","thumbnailUrl":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2311.12983.png","upvotes":244,"publishedAt":"2023-11-21T20:34:47.000Z","isUpvotedByUser":false},{"_id":"67c1836c2ab7556560679d4e","position":1,"type":"paper","id":"2310.16944","title":"Zephyr: Direct Distillation of LM Alignment","thumbnailUrl":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2310.16944.png","upvotes":123,"publishedAt":"2023-10-25T19:25:16.000Z","isUpvotedByUser":false},{"_id":"67c183614c84417091795839","position":2,"type":"paper","id":"2502.02737","title":"SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model","thumbnailUrl":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2502.02737.png","upvotes":254,"publishedAt":"2025-02-04T21:43:16.000Z","isUpvotedByUser":false},{"_id":"67c1827c5b49b516415e9f42","position":3,"type":"paper","id":"2412.03304","title":"Global MMLU: Understanding and Addressing Cultural and Linguistic Biases\n in Multilingual Evaluation","thumbnailUrl":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2412.03304.png","upvotes":19,"publishedAt":"2024-12-04T13:27:09.000Z","isUpvotedByUser":false}],"position":0,"theme":"orange","private":false,"shareUrl":"https://hf.co/collections/OpenEvals/research-collaborations","upvotes":1,"isUpvotedByUser":false},{"slug":"OpenEvals/making-evals-easy-67c1897cc80fbe12dcb2cbd0","title":"Making evals easy","description":"","gating":false,"lastUpdated":"2025-10-07T11:15:40.230Z","owner":{"_id":"67bed722e00731308d6a506d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6202a599216215a22221dea9/LHg8qch52zd2GAxCQvPoq.png","fullname":"OpenEvals","name":"OpenEvals","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":153,"isUserFollowing":false},"items":[{"_id":"67dadd41b5ed8b66e2d7ae51","position":0,"type":"space","author":"OpenEvals","authorData":{"_id":"67bed722e00731308d6a506d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6202a599216215a22221dea9/LHg8qch52zd2GAxCQvPoq.png","fullname":"OpenEvals","name":"OpenEvals","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":153,"isUserFollowing":false},"colorFrom":"indigo","colorTo":"purple","createdAt":"2025-01-16T10:37:58.000Z","emoji":"πŸ”","id":"OpenEvals/find-a-leaderboard","lastModified":"2025-03-04T08:29:15.000Z","likes":135,"pinned":true,"private":false,"sdk":"docker","repoType":"space","runtime":{"stage":"RUNNING","hardware":{"current":"cpu-basic","requested":"cpu-basic"},"storage":null,"gcTimeout":172800,"replicas":{"current":1,"requested":1},"devMode":false,"domains":[{"domain":"leaderboard-explorer-leaderboard-explorer.hf.space","stage":"READY"},{"domain":"openevals-find-a-leaderboard.hf.space","stage":"READY"}],"sha":"0dfbcb9e1a01dd0bf3995e8e02ac7d9335faf2bf"},"shortDescription":"Explore and discover all leaderboards from the HF community","title":"Find a leaderboard","isLikedByUser":false,"ai_short_description":"Display leaderboards with dark mode support","ai_category":"Data Visualization","trendingScore":3,"tags":["docker","explorer","region:us"],"featured":false},{"_id":"68e4e239032a780816501931","position":1,"type":"space","author":"yourbench","authorData":{"_id":"678905a9cd3f9fe60098e689","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63f5010dfcf95ecac2ad8652/jda1zapqdQU6Og-YjH2ta.png","fullname":"Your Bench","name":"yourbench","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":71,"isUserFollowing":false},"colorFrom":"yellow","colorTo":"gray","createdAt":"2025-03-13T21:09:54.000Z","emoji":"πŸš€","id":"yourbench/advanced","lastModified":"2025-10-07T09:53:54.000Z","likes":44,"pinned":false,"private":false,"sdk":"docker","repoType":"space","runtime":{"stage":"RUNNING","hardware":{"current":"cpu-upgrade","requested":"cpu-upgrade"},"storage":"small","gcTimeout":172800,"replicas":{"current":1,"requested":1},"devMode":false,"domains":[{"domain":"yourbench-yourbench-space.hf.space","stage":"READY"},{"domain":"yourbench-yourbench-demo.hf.space","stage":"READY"},{"domain":"yourbench-advanced.hf.space","stage":"READY"}],"sha":"ed40d3a0a367926bbc111a8d86b18eec73605a48"},"shortDescription":"Generate custom evaluations from your data easily!","title":"YourBench","isLikedByUser":false,"ai_short_description":"Create dynamic benchmarks from documents","ai_category":"Document Analysis","trendingScore":0,"tags":["docker","region:us"],"featured":false},{"_id":"67c189a5c5952c32b920a8ae","position":2,"type":"space","note":{"html":"An example leaderboard you can fork if you need\n","text":"An example leaderboard you can fork if you need\n"},"author":"demo-leaderboard-backend","authorData":{"_id":"655dbd8360009b03e4451217","avatarUrl":"https://www.gravatar.com/avatar/48236a8e5b71950f0708b3f2e3e7925f?d=retro&size=100","fullname":"Demo leaderboard with an integrated backend","name":"demo-leaderboard-backend","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":22,"isUserFollowing":false},"colorFrom":"green","colorTo":"indigo","createdAt":"2023-11-22T08:41:23.000Z","emoji":"πŸ₯‡","id":"demo-leaderboard-backend/leaderboard","lastModified":"2025-08-21T10:42:52.000Z","likes":16,"pinned":true,"private":false,"sdk":"gradio","repoType":"space","runtime":{"stage":"RUNTIME_ERROR","hardware":{"current":null,"requested":"cpu-upgrade"},"storage":null,"gcTimeout":null,"errorMessage":"Exit code: 1. Reason: r_fn\n return fn(*args, **kwargs)\n File \"/usr/local/lib/python3.10/site-packages/huggingface_hub/_snapshot_download.py\", line 248, in snapshot_download\n raise LocalEntryNotFoundError(\nhuggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_http.py\", line 409, in hf_raise_for_status\n response.raise_for_status()\n File \"/usr/local/lib/python3.10/site-packages/requests/models.py\", line 1026, in raise_for_status\n raise HTTPError(http_error_msg, response=self)\nrequests.exceptions.HTTPError: 504 Server Error: Gateway Timeout for url: https://huggingface.co/api/spaces/demo-leaderboard-backend/leaderboard/restart\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n File \"/home/user/app/app.py\", line 42, in \n restart_space()\n File \"/home/user/app/app.py\", line 33, in restart_space\n API.restart_space(repo_id=REPO_ID)\n File \"/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py\", line 114, in _inner_fn\n return fn(*args, **kwargs)\n File \"/usr/local/lib/python3.10/site-packages/huggingface_hub/hf_api.py\", line 7349, in restart_space\n hf_raise_for_status(r)\n File \"/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_http.py\", line 482, in hf_raise_for_status\n raise _format(HfHubHTTPError, str(e), response) from e\nhuggingface_hub.errors.HfHubHTTPError: 504 Server Error: Gateway Timeout for url: https://huggingface.co/api/spaces/demo-leaderboard-backend/leaderboard/restart (Request ID: Root=1-68dee009-03ce9c6d26c707a038e76e56;64245849-e8d1-4da7-9204-e14791d46b0e)\n\nThe request is taking longer than expected, please try again later.\n","replicas":{"requested":1},"devMode":false,"domains":[{"domain":"demo-leaderboard-backend-leaderboard.hf.space","stage":"READY"}]},"shortDescription":"Duplicate this leaderboard to initialize your own!","title":"Example Leaderboard Template","isLikedByUser":false,"ai_short_description":"View and submit LLM evaluations","ai_category":"Model Benchmarking","trendingScore":0,"tags":["gradio","leaderboard","region:us"],"featured":false},{"_id":"68e4f65c3e1d4efa12fd67c1","position":3,"type":"space","author":"OpenEvals","authorData":{"_id":"67bed722e00731308d6a506d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6202a599216215a22221dea9/LHg8qch52zd2GAxCQvPoq.png","fullname":"OpenEvals","name":"OpenEvals","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":153,"isUserFollowing":false},"colorFrom":"green","colorTo":"blue","createdAt":"2025-10-07T09:54:51.000Z","emoji":"🐒","id":"OpenEvals/EvalsOnTheHub","lastModified":"2025-10-10T08:45:11.000Z","likes":2,"pinned":false,"private":false,"sdk":"gradio","repoType":"space","runtime":{"stage":"RUNNING","hardware":{"current":"cpu-upgrade","requested":"cpu-upgrade"},"storage":null,"gcTimeout":172800,"replicas":{"current":1,"requested":1},"devMode":false,"domains":[{"domain":"openevals-evalsonthehub.hf.space","stage":"READY"}],"sha":"c97be0a289f712b2dd698d28fefc2686f81050bc"},"title":"Run your LLM evaluations on the hub","isLikedByUser":false,"ai_short_description":"Generate a command to run model evaluations","ai_category":"Text Generation","trendingScore":0,"tags":["gradio","region:us"],"featured":false}],"position":1,"theme":"purple","private":false,"shareUrl":"https://hf.co/collections/OpenEvals/making-evals-easy","upvotes":0,"isUpvotedByUser":false},{"slug":"OpenEvals/yourbench-67ed7bff229aab5b7c50ca24","title":"YourBench","description":"","gating":false,"lastUpdated":"2025-10-07T09:49:28.604Z","owner":{"_id":"67bed722e00731308d6a506d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6202a599216215a22221dea9/LHg8qch52zd2GAxCQvPoq.png","fullname":"OpenEvals","name":"OpenEvals","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":153,"isUserFollowing":false},"items":[{"_id":"67ed7c1554dccfe33def0000","position":1,"type":"space","author":"yourbench","authorData":{"_id":"678905a9cd3f9fe60098e689","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63f5010dfcf95ecac2ad8652/jda1zapqdQU6Og-YjH2ta.png","fullname":"Your Bench","name":"yourbench","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":71,"isUserFollowing":false},"colorFrom":"yellow","colorTo":"gray","createdAt":"2025-03-13T21:09:54.000Z","emoji":"πŸš€","id":"yourbench/advanced","lastModified":"2025-10-07T09:53:54.000Z","likes":44,"pinned":false,"private":false,"sdk":"docker","repoType":"space","runtime":{"stage":"RUNNING","hardware":{"current":"cpu-upgrade","requested":"cpu-upgrade"},"storage":"small","gcTimeout":172800,"replicas":{"current":1,"requested":1},"devMode":false,"domains":[{"domain":"yourbench-yourbench-space.hf.space","stage":"READY"},{"domain":"yourbench-yourbench-demo.hf.space","stage":"READY"},{"domain":"yourbench-advanced.hf.space","stage":"READY"}],"sha":"ed40d3a0a367926bbc111a8d86b18eec73605a48"},"shortDescription":"Generate custom evaluations from your data easily!","title":"YourBench","isLikedByUser":false,"ai_short_description":"Create dynamic benchmarks from documents","ai_category":"Document Analysis","trendingScore":0,"tags":["docker","region:us"],"featured":false},{"_id":"67ed7c2a6f1f9e4152d21714","position":2,"type":"dataset","author":"sumukshashidhar-archive","downloads":125,"gated":false,"id":"sumukshashidhar-archive/tempora","lastModified":"2025-06-02T23:55:35.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":13217,"libraries":["datasets","dask","mlcroissant","polars"],"formats":["parquet"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":3,"isLikedByUser":false,"isBenchmark":false}],"position":3,"theme":"indigo","private":false,"shareUrl":"https://hf.co/collections/OpenEvals/yourbench","upvotes":0,"isUpvotedByUser":false},{"slug":"OpenEvals/archived-open-llm-leaderboard-2024-2025-67c1796926298572a216ebf5","title":"Archived Open LLM Leaderboard (2024-2025)","description":"This leaderboard has been evaluating LLMs from Jun 2024 on IFEval, MuSR, GPQA, MATH, BBH and MMLU-Pro","gating":false,"lastUpdated":"2025-10-07T09:49:28.693Z","owner":{"_id":"67bed722e00731308d6a506d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6202a599216215a22221dea9/LHg8qch52zd2GAxCQvPoq.png","fullname":"OpenEvals","name":"OpenEvals","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":153,"isUserFollowing":false},"items":[{"_id":"67c17a962665c6ccb149b2c5","position":0,"type":"space","note":{"html":"Blog on why we made a new version of the Open LLM Leaderboard\n","text":"Blog on why we made a new version of the Open LLM Leaderboard\n"},"author":"open-llm-leaderboard","authorData":{"_id":"649070e345920777b9f1f5c1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5df7e9e5da6d0311fd3d53f9/j21QZzv9_PGPUH5FbUaeM.png","fullname":"Open LLM Leaderboard","name":"open-llm-leaderboard","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":1720,"isUserFollowing":false},"colorFrom":"pink","colorTo":"red","createdAt":"2024-06-23T16:59:22.000Z","emoji":"πŸ”οΈ","id":"open-llm-leaderboard/blog","lastModified":"2024-07-01T08:57:50.000Z","likes":125,"pinned":false,"private":false,"sdk":"static","repoType":"space","runtime":{"stage":"RUNNING","hardware":{"current":null,"requested":null},"storage":null,"replicas":{"requested":1,"current":1}},"title":"Open-LLM performances are plateauing, let’s make the leaderboard steep again","isLikedByUser":false,"ai_short_description":"Explore and compare advanced language models on a new leaderboard","ai_category":"Text Analysis","trendingScore":0,"tags":["static","region:us"],"featured":true},{"_id":"67c17a893844db9f344218ec","position":1,"type":"space","note":{"html":"The actual leaderboard! With a stylish new ux :)\n","text":"The actual leaderboard! With a stylish new ux :)\n"},"author":"open-llm-leaderboard","authorData":{"_id":"649070e345920777b9f1f5c1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5df7e9e5da6d0311fd3d53f9/j21QZzv9_PGPUH5FbUaeM.png","fullname":"Open LLM Leaderboard","name":"open-llm-leaderboard","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"plan":"team","followerCount":1720,"isUserFollowing":false},"colorFrom":"blue","colorTo":"red","createdAt":"2023-04-17T11:40:06.000Z","emoji":"πŸ†","id":"open-llm-leaderboard/open_llm_leaderboard","lastModified":"2025-03-25T09:02:15.000Z","likes":13856,"pinned":true,"private":false,"sdk":"docker","repoType":"space","runtime":{"stage":"RUNNING","hardware":{"current":"cpu-upgrade","requested":"cpu-upgrade"},"storage":"small","gcTimeout":172800,"replicas":{"current":1,"requested":1},"devMode":false,"domains":[{"domain":"open-llm-leaderboard-open-llm-leaderboard.hf.space","stage":"READY"}],"sha":"6ee9164f8a40124224ffd0ca2be9d859f048dacb"},"shortDescription":"Track, rank and evaluate open LLMs and chatbots","title":"Open LLM Leaderboard","isLikedByUser":false,"ai_short_description":"Compare open-source LLMs across multiple benchmarks","ai_category":"Data Visualization","trendingScore":12,"tags":["docker","leaderboard","modality:text","submission:automatic","test:public","language:english","eval:code","eval:math","region:us"],"featured":false},{"_id":"67c17aa02620d9e36b091c9d","position":2,"type":"dataset","note":{"html":"If you want to download the main leaderboard table, you'll find the dataset here!\n","text":"If you want to download the main leaderboard table, you'll find the dataset here!\n"},"author":"open-llm-leaderboard","downloads":9973,"gated":false,"id":"open-llm-leaderboard/contents","lastModified":"2025-03-20T12:17:27.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":4576,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["parquet"],"modalities":["tabular","text"]},"private":false,"repoType":"dataset","likes":21,"isLikedByUser":false,"isBenchmark":false},{"_id":"67c17aac08250e6cd7650588","position":3,"type":"dataset","note":{"html":"To extract more detailed aggregated results for each model, look here!","text":"To extract more detailed aggregated results for each model, look here!"},"author":"open-llm-leaderboard","downloads":22644,"gated":false,"id":"open-llm-leaderboard/results","lastModified":"2025-03-15T05:57:14.000Z","datasetsServerInfo":{"viewer":"preview","numRows":0,"libraries":[],"formats":[],"modalities":[]},"private":false,"repoType":"dataset","likes":18,"isLikedByUser":false,"isBenchmark":false}],"position":4,"theme":"pink","private":false,"shareUrl":"https://hf.co/collections/OpenEvals/archived-open-llm-leaderboard-2024-2025","upvotes":0,"isUpvotedByUser":false},{"slug":"OpenEvals/archived-open-llm-leaderboard-2023-2024-67c177b19855155fbc5f1fa1","title":"Archived Open LLM Leaderboard (2023-2024)","description":"This leaderboard evaluated 7K LLMs from Apr 2023 to Jun 2024, on ARC-c, HellaSwag, MMLU, TruthfulQA, Winogrande and GSM8K","gating":false,"lastUpdated":"2025-10-07T09:49:28.693Z","owner":{"_id":"67bed722e00731308d6a506d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6202a599216215a22221dea9/LHg8qch52zd2GAxCQvPoq.png","fullname":"OpenEvals","name":"OpenEvals","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":153,"isUserFollowing":false},"items":[{"_id":"67c17917b606ad8161744c9c","position":0,"type":"space","author":"open-llm-leaderboard-old","authorData":{"_id":"66608c3ecf3bb5532271a754","avatarUrl":"https://www.gravatar.com/avatar/4e74f54a5a0bd0efd064aabfd81f13f1?d=retro&size=100","fullname":"Open LLM Leaderboard Archive","name":"open-llm-leaderboard-old","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":60,"isUserFollowing":false},"colorFrom":"green","colorTo":"indigo","createdAt":"2024-06-25T15:29:15.000Z","emoji":"πŸ†","id":"open-llm-leaderboard-old/open_llm_leaderboard","lastModified":"2025-09-08T07:48:03.000Z","likes":104,"pinned":true,"private":false,"sdk":"gradio","repoType":"space","runtime":{"stage":"RUNNING","hardware":{"current":"cpu-upgrade","requested":"cpu-upgrade"},"storage":"small","gcTimeout":172800,"replicas":{"current":1,"requested":1},"devMode":false,"domains":[{"domain":"open-llm-leaderboard-old-open-llm-leaderboard.hf.space","stage":"READY"}],"sha":"a9e9888df143423aa669ffbb531e8c183f53bce9"},"shortDescription":"Track, rank and evaluate open LLMs and chatbots","title":"Open LLM Leaderboard","isLikedByUser":false,"ai_short_description":"Display and analyze LLM benchmark data","ai_category":"Model Benchmarking","trendingScore":0,"tags":["gradio","leaderboard","region:us"],"featured":false},{"_id":"67c1792abd28bbc01367464c","position":1,"type":"dataset","author":"open-llm-leaderboard-old","downloads":411626,"gated":false,"id":"open-llm-leaderboard-old/requests","lastModified":"2024-06-19T21:36:08.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":3,"libraries":["datasets","dask","mlcroissant"],"formats":["json"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":22,"isLikedByUser":false,"isBenchmark":false},{"_id":"67c179327dfda2d366e16b5f","position":2,"type":"dataset","author":"open-llm-leaderboard-old","downloads":7870,"gated":false,"id":"open-llm-leaderboard-old/results","lastModified":"2024-07-18T13:49:22.000Z","datasetsServerInfo":{"viewer":"preview","numRows":0,"libraries":[],"formats":[],"modalities":[]},"private":false,"repoType":"dataset","likes":50,"isLikedByUser":false,"isBenchmark":false}],"position":7,"theme":"green","private":false,"shareUrl":"https://hf.co/collections/OpenEvals/archived-open-llm-leaderboard-2023-2024","upvotes":0,"isUpvotedByUser":false}],"datasets":[{"author":"OpenEvals","downloads":116,"gated":false,"id":"OpenEvals/IMO-AnswerBench","lastModified":"2026-01-23T16:26:25.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":400,"libraries":["datasets","pandas","polars","mlcroissant"],"formats":["parquet","optimized-parquet"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"author":"OpenEvals","downloads":47,"gated":false,"id":"OpenEvals/MuSR","lastModified":"2025-12-12T12:58:31.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":756,"libraries":["datasets","pandas","polars","mlcroissant"],"formats":["parquet","optimized-parquet"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"author":"OpenEvals","downloads":128,"gated":false,"id":"OpenEvals/aime_24","lastModified":"2025-12-12T12:56:49.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":30,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["parquet"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":1,"isLikedByUser":false,"isBenchmark":false},{"author":"OpenEvals","downloads":610,"gated":false,"id":"OpenEvals/SimpleQA","lastModified":"2025-12-12T12:56:01.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":4326,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["parquet"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":4,"isLikedByUser":false,"isBenchmark":false}],"models":[],"paperPreviews":[],"spaces":[{"author":"OpenEvals","authorData":{"_id":"67bed722e00731308d6a506d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6202a599216215a22221dea9/LHg8qch52zd2GAxCQvPoq.png","fullname":"OpenEvals","name":"OpenEvals","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":153,"isUserFollowing":false},"colorFrom":"blue","colorTo":"red","createdAt":"2025-10-16T09:08:47.000Z","emoji":"πŸ“š","id":"OpenEvals/open_benchmark_index","lastModified":"2026-01-14T14:39:12.000Z","likes":28,"pinned":true,"private":false,"sdk":"gradio","repoType":"space","runtime":{"stage":"RUNNING","hardware":{"current":"cpu-basic","requested":"cpu-basic"},"storage":null,"gcTimeout":172800,"replicas":{"current":1,"requested":1},"devMode":false,"domains":[{"domain":"saylortwift-benchmark-finder.hf.space","stage":"READY"},{"domain":"saylortwift-open-benchmark-index.hf.space","stage":"READY"},{"domain":"openevals-open-benchmark-index.hf.space","stage":"READY"}],"sha":"97ae1bac1c1ea5c940864953ff931d0a8970b42a"},"shortDescription":"A space to view and inspect all the tasks in lighteval","title":"Benchmark Finder","isLikedByUser":false,"ai_short_description":"Explore and search through Lighteval benchmark tasks","ai_category":"Text Analysis","trendingScore":0,"tags":["gradio","region:us"],"featured":false},{"author":"OpenEvals","authorData":{"_id":"67bed722e00731308d6a506d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6202a599216215a22221dea9/LHg8qch52zd2GAxCQvPoq.png","fullname":"OpenEvals","name":"OpenEvals","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":153,"isUserFollowing":false},"colorFrom":"blue","colorTo":"indigo","createdAt":"2025-11-24T13:18:24.000Z","emoji":"πŸ“","id":"OpenEvals/evaluation-guidebook","lastModified":"2025-12-04T09:45:48.000Z","likes":272,"pinned":true,"private":false,"sdk":"docker","repoType":"space","runtime":{"stage":"RUNNING","hardware":{"current":"cpu-basic","requested":"cpu-basic"},"storage":null,"gcTimeout":172800,"replicas":{"current":1,"requested":1},"devMode":false,"domains":[{"domain":"clefourrier-evaluation-guidebook.hf.space","stage":"READY"},{"domain":"openevals-evaluation-guidebook.hf.space","stage":"READY"}],"sha":"c7ddebad779a8bcef006f9e0ab2caa1f896df346"},"title":"Evaluation Guidebook","isLikedByUser":false,"originRepo":{"name":"tfrere/research-article-template","author":{"_id":"64cb7fdb9e30a46f7b92aa45","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64cb7fdb9e30a46f7b92aa45/TKaRtn_-R_W__QY8DsQv3.jpeg","fullname":"frere thibaud","name":"tfrere","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":105,"isUserFollowing":false}},"ai_short_description":"Explore LLM benchmark trends over time","ai_category":"Data Visualization","trendingScore":4,"tags":["docker","research-article-template","research paper","scientific paper","data visualization","region:us"],"featured":false},{"author":"OpenEvals","authorData":{"_id":"67bed722e00731308d6a506d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6202a599216215a22221dea9/LHg8qch52zd2GAxCQvPoq.png","fullname":"OpenEvals","name":"OpenEvals","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":153,"isUserFollowing":false},"colorFrom":"indigo","colorTo":"purple","createdAt":"2025-01-16T10:37:58.000Z","emoji":"πŸ”","id":"OpenEvals/find-a-leaderboard","lastModified":"2025-03-04T08:29:15.000Z","likes":135,"pinned":true,"private":false,"sdk":"docker","repoType":"space","runtime":{"stage":"RUNNING","hardware":{"current":"cpu-basic","requested":"cpu-basic"},"storage":null,"gcTimeout":172800,"replicas":{"current":1,"requested":1},"devMode":false,"domains":[{"domain":"leaderboard-explorer-leaderboard-explorer.hf.space","stage":"READY"},{"domain":"openevals-find-a-leaderboard.hf.space","stage":"READY"}],"sha":"0dfbcb9e1a01dd0bf3995e8e02ac7d9335faf2bf"},"shortDescription":"Explore and discover all leaderboards from the HF community","title":"Find a leaderboard","isLikedByUser":false,"ai_short_description":"Display leaderboards with dark mode support","ai_category":"Data Visualization","trendingScore":3,"tags":["docker","explorer","region:us"],"featured":false},{"author":"OpenEvals","authorData":{"_id":"67bed722e00731308d6a506d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6202a599216215a22221dea9/LHg8qch52zd2GAxCQvPoq.png","fullname":"OpenEvals","name":"OpenEvals","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":153,"isUserFollowing":false},"colorFrom":"red","colorTo":"blue","createdAt":"2025-11-19T10:44:37.000Z","emoji":"🐠","id":"OpenEvals/aa-omniscience","lastModified":"2025-11-19T10:44:43.000Z","likes":2,"pinned":false,"private":false,"sdk":"static","repoType":"space","runtime":{"stage":"RUNNING","hardware":{"current":null,"requested":null},"storage":null,"replicas":{"requested":1,"current":1}},"title":"Aa Omniscience","isLikedByUser":false,"ai_short_description":"Display and inspect log files","ai_category":"Document Analysis","trendingScore":0,"tags":["static","region:us"],"featured":false},{"author":"OpenEvals","authorData":{"_id":"67bed722e00731308d6a506d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6202a599216215a22221dea9/LHg8qch52zd2GAxCQvPoq.png","fullname":"OpenEvals","name":"OpenEvals","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":153,"isUserFollowing":false},"colorFrom":"yellow","colorTo":"indigo","createdAt":"2025-10-10T09:49:21.000Z","emoji":"πŸ“ˆ","id":"OpenEvals/InferenceProviderTesting","lastModified":"2025-11-04T13:01:21.000Z","likes":2,"pinned":false,"private":false,"sdk":"gradio","repoType":"space","runtime":{"stage":"SLEEPING","hardware":{"current":null,"requested":"cpu-basic"},"storage":null,"gcTimeout":172800,"replicas":{"requested":1},"devMode":false,"domains":[{"domain":"iptesting-inferenceprovidertestingbackend.hf.space","stage":"READY"},{"domain":"openevals-inferenceprovidertesting.hf.space","stage":"READY"}]},"title":"InferenceProviderTestingBackend","isLikedByUser":false,"ai_short_description":"Launch and monitor model evaluation jobs","ai_category":"Other","trendingScore":0,"tags":["gradio","region:us"],"featured":false}],"buckets":[],"numBuckets":0,"numDatasets":4,"numModels":0,"numSpaces":9,"lastOrgActivities":[{"time":"2026-02-10T12:34:30.200Z","user":"SaylorTwift","userAvatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678663263366-63e0eea7af523c37e5a77966.jpeg","org":"OpenEvals","orgAvatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6202a599216215a22221dea9/LHg8qch52zd2GAxCQvPoq.png","type":"discussion","discussionData":{"num":2,"author":{"_id":"62d648291fa3e4e7ae3fa6e8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d648291fa3e4e7ae3fa6e8/oatOwf8Xqe5eDbCSuYqCd.png","fullname":"ben burtenshaw","name":"burtenshaw","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":4317,"isUserFollowing":false},"repo":{"name":"OpenEvals/README","type":"space"},"title":"New Benchmark Dataset","status":"open","createdAt":"2026-01-29T11:29:12.000Z","isPullRequest":false,"numComments":8,"topReactions":[{"count":4,"reaction":"πŸš€"}],"numReactionUsers":4,"pinned":true,"repoOwner":{"name":"OpenEvals","isParticipating":true,"type":"org","isDiscussionAuthor":false}},"repoId":"OpenEvals/README","repoType":"space","eventId":"698b25d6e7f9d74f2051fbfa"},{"time":"2026-02-10T11:22:55.457Z","user":"SaylorTwift","userAvatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678663263366-63e0eea7af523c37e5a77966.jpeg","org":"OpenEvals","orgAvatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6202a599216215a22221dea9/LHg8qch52zd2GAxCQvPoq.png","type":"discussion","discussionData":{"num":1,"author":{"_id":"62d648291fa3e4e7ae3fa6e8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d648291fa3e4e7ae3fa6e8/oatOwf8Xqe5eDbCSuYqCd.png","fullname":"ben burtenshaw","name":"burtenshaw","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":4317,"isUserFollowing":false},"repo":{"name":"OpenEvals/README","type":"space"},"title":"Community Evals Feedback","status":"open","createdAt":"2026-01-29T11:26:13.000Z","isPullRequest":false,"numComments":3,"topReactions":[{"count":2,"reaction":"πŸ”₯"}],"numReactionUsers":2,"pinned":false,"repoOwner":{"name":"OpenEvals","isParticipating":true,"type":"org","isDiscussionAuthor":false}},"repoId":"OpenEvals/README","repoType":"space","eventId":"698b150fe2c2fc709ae1ea1d"},{"time":"2026-01-14T14:39:13.669Z","user":"SaylorTwift","userAvatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678663263366-63e0eea7af523c37e5a77966.jpeg","orgAvatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6202a599216215a22221dea9/LHg8qch52zd2GAxCQvPoq.png","type":"update","repoData":{"author":"OpenEvals","authorData":{"_id":"67bed722e00731308d6a506d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6202a599216215a22221dea9/LHg8qch52zd2GAxCQvPoq.png","fullname":"OpenEvals","name":"OpenEvals","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":153,"isUserFollowing":false},"colorFrom":"blue","colorTo":"red","createdAt":"2025-10-16T09:08:47.000Z","emoji":"πŸ“š","id":"OpenEvals/open_benchmark_index","lastModified":"2026-01-14T14:39:12.000Z","likes":28,"pinned":true,"private":false,"sdk":"gradio","repoType":"space","runtime":{"stage":"RUNNING","hardware":{"current":"cpu-basic","requested":"cpu-basic"},"storage":null,"gcTimeout":172800,"replicas":{"current":1,"requested":1},"devMode":false,"domains":[{"domain":"saylortwift-benchmark-finder.hf.space","stage":"READY"},{"domain":"saylortwift-open-benchmark-index.hf.space","stage":"READY"},{"domain":"openevals-open-benchmark-index.hf.space","stage":"READY"}],"sha":"97ae1bac1c1ea5c940864953ff931d0a8970b42a"},"shortDescription":"A space to view and inspect all the tasks in lighteval","title":"Benchmark Finder","isLikedByUser":false,"ai_short_description":"Explore and search through Lighteval benchmark tasks","ai_category":"Text Analysis","trendingScore":0,"tags":["gradio","region:us"],"featured":false},"repoId":"OpenEvals/open_benchmark_index","repoType":"space","org":"OpenEvals"}],"acceptLanguages":["*"],"canReadRepos":false,"canReadSpaces":false,"blogPosts":[],"currentRepoPage":0,"filters":{},"paperView":false}">

    AI & ML interests

    LLM evaluation

    Recent Activity

    Hi! Welcome on the org page of the Evaluation team at HuggingFace. We want to support the community in building and sharing quality evaluations, for reproducible and fair model comparisions, to cut through the hype of releases and better understand actual model capabilities.

    We're behind the:

    • evaluation guidebook, your reference for LLM evals
    • lighteval LLM evaluation suite, fast and filled with the SOTA benchmarks you might want
    • leaderboards on the hub initiative, to encourage people to build more leaderboards in the open for more reproducible evaluation. You'll find some doc here to build your own, and you can look for the best leaderboard for your use case here!

    Our archived projects:

    We're not behind the evaluate metrics guide but if you want to understand metrics better we really recommend checking it out!

    models 0

    None public yet