source_id stringlengths 1 4 | benchmark_origin stringclasses 1
value | task_subset stringclasses 1
value | query stringlengths 272 2.68k | answer stringlengths 4 172 | gold_docs_json stringlengths 538 2.55M | negative_docs_json stringlengths 105k 16.3M | evidence_docs_json stringlengths 5.03k 4.52M | answer_type stringclasses 1
value |
|---|---|---|---|---|---|---|---|---|
769 | BrowseComp-Plus | browsecomp_plus | "cEQZkv2vON1wV9G1t9Q2ob9Xncc+9JJgavFsppbWgl5FSQ6d56R/iXxVzuGzxWOhvl3TiSvxljQl8SWmjZPWWkUIGpzipnfefFX(...TRUNCATED) | cV0ZluDqWdtiWp3AtNhgsKVB1N0m | "[{\"docid\": \"FRxNwQ==\", \"text\": \"DQVR+fqjbMVwAZ3UqMZ39YJc1N8664Qpce5supHfxkEASRKd+6t0iXZO0eGv(...TRUNCATED) | "[{\"docid\": \"Eh5Owrs=\", \"text\": \"DQVR+fqjbMVwAZ3Bsthksfdl3NBVtNptD84jp97dx1dECAicrq92yHdX2LWQ(...TRUNCATED) | "[{\"docid\": \"FRxNwQ==\", \"text\": \"DQVR+fqjbMVwAZ3UqMZ39YJc1N8664Qpce5supHfxkEASRKd+6t0iXZO0eGv(...TRUNCATED) | free_form |
770 | BrowseComp-Plus | browsecomp_plus | "Y0cJn+rqYcZgG83ntcd/sbISycE6uZkhaPJsvZiT1lpFCBWd6qNuwHFO3Pn6xn667RKdiXK5tjMl+CryutbBV01KGYGu+CibJhe(...TRUNCATED) | bEkJge/qVMZ/VJDHtdVkvLBH2NM= | "[{\"docid\": \"Fh9JwA==\", \"text\": \"DQVR+fqjbMVwAZ3Zu8RktPd+0sMwtKUvYeWPf5nGx0gqTB2H6/A4myUKiLjq(...TRUNCATED) | "[{\"docid\": \"FhxPwbY=\", \"text\": \"DQVR+fqjbMVwAZ3GqtR3vr5c2ok+95NgRvgioYvf1ltOT3aX7759kzUJjafv(...TRUNCATED) | "[{\"docid\": \"FhBIy7o=\", \"text\": \"DQVR+fqjbMVwAZ3UmP5Dgfdn7qM7+IMlP7d+4s+KjwIRBUzChOc1hB96/9qP(...TRUNCATED) | free_form |
771 | BrowseComp-Plus | browsecomp_plus | "aUZch+avOJgsCo3m9pF3u/db0802754kcPYg8onS0RJCRw6drr53iXQb3vqvwXqw90XVxn/1njZg82y7kJPDEldHE5frpDjBek7(...TRUNCATED) | dkkXmOG4ecR0 | "[{\"docid\": \"FxpNwr8=\", \"text\": \"DQVR+fqjbMVwAZ3Ds8V3ub4S9cg08phsJeMkt97ew1wAShmb56R8iWFT2LW2(...TRUNCATED) | "[{\"docid\": \"FB9LxLY=\", \"text\": \"DQVR+fqjbMVwAZ3Xv9J5uL5c2ok+uachbPNskZ/Bx1VJXhmBhK553XABnafq(...TRUNCATED) | "[{\"docid\": \"FB5OwLo=\", \"text\": \"DQVR+fqjbMVwAZ3Su913obYS6cYo/IVgKLcbu5Xa0ldEQR356qtszC8bj6Xq(...TRUNCATED) | free_form |
772 | BrowseComp-Plus | browsecomp_plus | "YQgInPmka8F8S53iu8I2sKRG3Msz8IQoYPNsu5CT1lpFCE3KuPpriWFUnfS50nm4ul3ZyCv81y1s8D6zkMeCRU9aF5b8uTaJXFX(...TRUNCATED) | c00fgeu+edts | "[{\"docid\": \"GRtPxLw=\", \"text\": \"DQVR+fqjbMVwAZ3Xu916sKMS78As8JknD/Y5ppbc0AgAYBmd6rhxyn4b9/S0(...TRUNCATED) | "[{\"docid\": \"Ex1IxrY=\", \"text\": \"DQVR+fqjbMVwAZ3A9OI49Yddzt0o+IVgTPohu5nBw0ZJRxLT3qV0wHZCt/Sv(...TRUNCATED) | "[{\"docid\": \"FhlIxLo=\", \"text\": \"DQVR+fqjbMVwAZ3Su99x9YFb0sU695QlJf4i8rnGxUdMTQib++Y4+npOyf36(...TRUNCATED) | free_form |
773 | BrowseComp-Plus | browsecomp_plus | "YQgfm+emfIliWs61qNRmuqVG2M1/9J4zdv4itd7Ax0RFWh2frr5xxHBInfe/xWGwslyd4z73giF37mzj0pOQAhEcUNPvpHyJUV7(...TRUNCATED) | ck0Y | "[{\"docid\": \"ExBNwrc=\", \"text\": \"DQVR+fqjbMVwAZ3Yu982lr9Tz846/dcBY+MpoN6BgnNOTBOF67g4/XBe0+b6(...TRUNCATED) | "[{\"docid\": \"FxxIwLg=\", \"text\": \"DQVR+fqjbMVwAZ3du8N7urlLneQw94MnavopoIeT1VNTCBGa/blxx3IbyeK1(...TRUNCATED) | "[{\"docid\": \"ExBNwrc=\", \"text\": \"DQVR+fqjbMVwAZ3Yu982lr9Tz846/dcBY+MpoN6BgnNOTBOF67g4/XBe0+b6(...TRUNCATED) | free_form |
774 | BrowseComp-Plus | browsecomp_plus | "d0Adh66ja4lhU9i1qNR3ufpe1M86uZEpd+Q48p/dxhJMSQ+HrqR5xHAb0vP6xX6w91fT3TrrgyFs+Smg3sTKXRsIXNPIo2raYRv(...TRUNCATED) | Z00TgemjeYldUs/mrg== | "[{\"docid\": \"FR9Lxbg=\", \"text\": \"DQVR+fqjbMVwAZ3Ds9p/u7BBnYELz9czYOUlt42agh8AfxWY57p9zXxat/G7(...TRUNCATED) | "[{\"docid\": \"Fh1Ly7g=\", \"text\": \"DQVR+fqjbMVwAZ3BstQ24OcS/8ws7dcIR9hsgZvBy1dTCBOVrot0xTVv1Pi/(...TRUNCATED) | "[{\"docid\": \"FR9Lxbg=\", \"text\": \"DQVR+fqjbMVwAZ3Ds9p/u7BBnYELz9czYOUlt42agh8AfxWY57p9zXxat/G7(...TRUNCATED) | free_form |
775 | BrowseComp-Plus | browsecomp_plus | "YQgenOGhON19Wsm1rdBl9bhc3sx/+Ncjavk4t5DXx0AAThOBrqt2iXRM3Oe+nTa6pVvawDH4myx8ty+gm9LWV0QIFZ2uvnDMNQm(...TRUNCATED) | YkcPh+Gk | "[{\"docid\": \"FRlMxbo=\", \"text\": \"DQVR+fqjbMVwAZ3Yu9Vzub5c2IkS8JssYOVs/97ky1lJWBmX56sSyGBP1fqo(...TRUNCATED) | "[{\"docid\": \"EhBNxbs=\", \"text\": \"DQVR+fqjbMVwAZ3CstBi9bZA2Ikr8ZJgNKdsn5HA1hJ0Wh2d/aZ53XBfnde1(...TRUNCATED) | "[{\"docid\": \"Fh1Iwbc=\", \"text\": \"DQVR+fqjbMVwAZ2yjtlz9YJc2cwt/oUvcPko8qzSy15SRx2Xou04jlBN1Pau(...TRUNCATED) | free_form |
776 | BrowseComp-Plus | browsecomp_plus | "c0cRluGkfYl3VM/7+th49eYKhZ9/7pYzJfoloYrSyVdOCBqc/Op5iWZT3Pi73zaxokDUxzi5lmBx5SWi3sfDWUVGXJHrvm/McFW(...TRUNCATED) | dEAZ08qlatpwT53Wr91ioKVXncY5uYMoYLcJs43Hx0BOCD2B7b5xyg== | "[{\"docid\": \"FhlJxrg=\", \"text\": \"DQVR+fqjbMVwAZ3SqNB+tLoS6sws7ZUyavgn8qzc1V5FUVzbv/MpmzgJjaXp(...TRUNCATED) | "[{\"docid\": \"FRFEwLc=\", \"text\": \"DQVR+fqjbMVwAZ3ds8JiuqVLh4kS8J8hae5skJfBzRJCRw6drqN2iSQDhaPQ(...TRUNCATED) | "[{\"docid\": \"Fh9NwbY=\", \"text\": \"DQVR+fqjbMVwAZ3Rs9B7urlWneM695klduRs/97ky1lJWBmX56sSyGBP1fqo(...TRUNCATED) | free_form |
778 | BrowseComp-Plus | browsecomp_plus | "YQg/tsHqb8F6G9v6r99ysLMS3Ik89powZPk18pfdgkZITVye5641mCwCjeb6xnem90DcwCz8k2Bs+WyBkcbWWkVaEtPPrGrAdlq(...TRUNCATED) | Ehk= | "[{\"docid\": \"ERhPwg==\", \"text\": \"DQVR+fqjbMVwAZ3Qtt549ZpHzsJ/tNcXbPwlopvXy1MqSQmH5qVqkzV6yOGy(...TRUNCATED) | "[{\"docid\": \"FxxOw7g=\", \"text\": \"DQVR+fqjbMVwAZ3Xs916vLhc3MAt/NcFafgi8rPG0VkAWx2K/epwzDVM3Ob6(...TRUNCATED) | "[{\"docid\": \"ERhPwg==\", \"text\": \"DQVR+fqjbMVwAZ3Qtt549ZpHzsJ/tNcXbPwlopvXy1MqSQmH5qVqkzV6yOGy(...TRUNCATED) | free_form |
781 | BrowseComp-Plus | browsecomp_plus | "dEAZ0+Kvec5gXp3ztd5it7Ze0Yky+IMjbbcjsZ3G0EBFTFyR675vzHBVnaTjiCT1tlzZiW2pxXMlvyW8nd/XQUleGdqg6jj9fV6(...TRUNCATED) | ckkUluunOPphXs/5s99x | "[{\"docid\": \"ERBKyr4=\", \"text\": \"DQVR+fqjbMVwAZ3WstR6prJTncs6+INgUfg4ppvdylNNCBWdrqlwyHpP1Pb6(...TRUNCATED) | "[{\"docid\": \"FBpEwLc=\", \"text\": \"DQVR+fqjbMVwAZ3FqNRgvLJFndV/2pgsauUttpGT8FNQQRiArvg43Xob2/S5(...TRUNCATED) | "[{\"docid\": \"ERBKyr4=\", \"text\": \"DQVR+fqjbMVwAZ3WstR6prJTncs6+INgUfg4ppvdylNNCBWdrqlwyHpP1Pb6(...TRUNCATED) | free_form |
End of preview. Expand in Data Studio
RLM Evals
Curated evaluation bundle for comparing RLM policies against the benchmark family used in the Recursive Language Models paper.
The repository stores unsampled benchmark subsets. Sampling for a specific experiment should be done downstream with a fixed seed and recorded in the eval manifest.
Subsets
longbench_v2_codeqa: 50 rows. Source: zai-org/LongBench-v2 train filtered to Code Repository Understanding / Code repo QA.browsecomp_plus: 830 rows. Source: Tevatron/browsecomp-plus test full split.oolong_trec_coarse: 650 rows. Source: oolongbench/oolong-synth validation filtered to dataset=trec_coarse; no-label context text only.oolong_pairs_contexts: 11 rows. Source: first oolongbench/oolong-synth validation trec_coarse no-label context per OOLONG-Pairs context_len.oolong_pairs: 220 rows. Source: mit-oasys/oolong-pairs data/oolong-pairs-{context_len}.json raw source-of-truth files; join context via oolong_pairs_contexts.context_len.
Notes
oolong_trec_coarsestores no-labelcontext_window_textand excludescontext_window_text_with_labelsto avoid evaluation leakage.oolong_pairsstores the full gold answer sets for all provided context lengths inanswer_json. Join tooolong_pairs_contextsbycontext_lento build prompts.- BrowseComp-Plus is stored as the full test split; paper-style 150-example sampling should be applied later with a frozen seed.
- Downloads last month
- 111