Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities
[go: Go Back, main page]

https://github.com/Retrieval-Infused-Reasoning-Sandbox/Retrieval-Infused-Reasoning-Sandbox
homepage: https://retrieval-infused-reasoning-sandbox.github.io/

\n","updatedAt":"2026-02-06T03:54:42.397Z","author":{"_id":"6377366acc034ef804cf0aef","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6377366acc034ef804cf0aef/JHsET8LJE2w1E7vJD-R2o.png","fullname":"ShawnYing","name":"ShawnYing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.44112762808799744},"editors":["ShawnYing"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6377366acc034ef804cf0aef/JHsET8LJE2w1E7vJD-R2o.png"],"reactions":[],"isReport":false,"parentCommentId":"698565cf48999b11958cecfa"}}]},{"id":"6986293d29238d0c819eb9a2","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2026-02-06T17:47:41.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXivLens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/retrieval-infused-reasoning-sandbox-a-benchmark-for-decoupling-retrieval-and-reasoning-capabilities-6192-add5f3f6\n- Executive Summary\n- Detailed Breakdown\n- Practical Applications","html":"

arXivLens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/retrieval-infused-reasoning-sandbox-a-benchmark-for-decoupling-retrieval-and-reasoning-capabilities-6192-add5f3f6

\n
    \n
  • Executive Summary
  • \n
  • Detailed Breakdown
  • \n
  • Practical Applications
  • \n
\n","updatedAt":"2026-02-06T17:47:41.791Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6021074652671814},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}},{"id":"6986589e57ce16a729e8b3be","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2026-02-06T21:09:50.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXivLens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/retrieval-infused-reasoning-sandbox-a-benchmark-for-decoupling-retrieval-and-reasoning-capabilities-6192-add5f3f6\n- Executive Summary\n- Detailed Breakdown\n- Practical Applications","html":"

arXivLens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/retrieval-infused-reasoning-sandbox-a-benchmark-for-decoupling-retrieval-and-reasoning-capabilities-6192-add5f3f6

\n
    \n
  • Executive Summary
  • \n
  • Detailed Breakdown
  • \n
  • Practical Applications
  • \n
\n","updatedAt":"2026-02-06T21:09:50.858Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6021074652671814},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}},{"id":"698697e4661b7c0fa193464b","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-07T01:39:48.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering](https://huggingface.co/papers/2601.19827) (2026)\n* [DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing](https://huggingface.co/papers/2601.03540) (2026)\n* [SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature](https://huggingface.co/papers/2601.10108) (2026)\n* [SciIF: Benchmarking Scientific Instruction Following Towards Rigorous Scientific Intelligence](https://huggingface.co/papers/2601.04770) (2026)\n* [RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension](https://huggingface.co/papers/2601.14289) (2026)\n* [Beyond the Needle's Illusion: Decoupled Evaluation of Evidence Access and Use under Semantic Interference at 326M-Token Scale](https://huggingface.co/papers/2601.20276) (2026)\n* [DeepEra: A Deep Evidence Reranking Agent for Scientific Retrieval-Augmented Generated Question Answering](https://huggingface.co/papers/2601.16478) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-02-07T01:39:48.346Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7195819616317749},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.21937","authors":[{"_id":"6985651d4ad556f294b7ebd9","name":"Shuangshuang Ying","hidden":false},{"_id":"6985651d4ad556f294b7ebda","user":{"_id":"68a6c5f3c48e21c896cfb701","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UexEznOwvwivI-k7Xe2Qa.png","isPro":false,"fullname":"Zheyu Wang","user":"FilianoreZY","type":"user"},"name":"Zheyu Wang","status":"claimed_verified","statusLastChangedAt":"2026-02-09T08:35:34.349Z","hidden":false},{"_id":"6985651d4ad556f294b7ebdb","name":"Yunjian Peng","hidden":false},{"_id":"6985651d4ad556f294b7ebdc","name":"Jin Chen","hidden":false},{"_id":"6985651d4ad556f294b7ebdd","name":"Yuhao Wu","hidden":false},{"_id":"6985651d4ad556f294b7ebde","name":"Hongbin Lin","hidden":false},{"_id":"6985651d4ad556f294b7ebdf","name":"Dingyu He","hidden":false},{"_id":"6985651d4ad556f294b7ebe0","name":"Siyi Liu","hidden":false},{"_id":"6985651d4ad556f294b7ebe1","name":"Gengchen Yu","hidden":false},{"_id":"6985651d4ad556f294b7ebe2","name":"YinZhu Piao","hidden":false},{"_id":"6985651d4ad556f294b7ebe3","name":"Yuchen Wu","hidden":false},{"_id":"6985651d4ad556f294b7ebe4","user":{"_id":"67dbf07f9d821d38905d145d","avatarUrl":"/avatars/e806467d2c0f4f642c8d4906b0855817.svg","isPro":false,"fullname":"guixin","user":"Ross12","type":"user"},"name":"Xin Gui","status":"claimed_verified","statusLastChangedAt":"2026-02-06T18:51:40.436Z","hidden":false},{"_id":"6985651d4ad556f294b7ebe5","name":"Zhongyuan Peng","hidden":false},{"_id":"6985651d4ad556f294b7ebe6","name":"Xin Li","hidden":false},{"_id":"6985651d4ad556f294b7ebe7","name":"Xeron Du","hidden":false},{"_id":"6985651d4ad556f294b7ebe8","name":"Libo Qin","hidden":false},{"_id":"6985651d4ad556f294b7ebe9","name":"YiXin Cao","hidden":false},{"_id":"6985651d4ad556f294b7ebea","user":{"_id":"638efcf4c67af472d316d424","avatarUrl":"/avatars/97a57859d7d87a3a8f1bb41d32a72bc2.svg","isPro":false,"fullname":"Ge Zhang","user":"zhangysk","type":"user"},"name":"Ge Zhang","status":"claimed_verified","statusLastChangedAt":"2026-02-06T18:51:37.889Z","hidden":false},{"_id":"6985651d4ad556f294b7ebeb","name":"Stephen Huang","hidden":false}],"publishedAt":"2026-01-29T16:26:19.000Z","submittedOnDailyAt":"2026-02-06T01:23:51.106Z","title":"Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities","submittedOnDailyBy":{"_id":"6377366acc034ef804cf0aef","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6377366acc034ef804cf0aef/JHsET8LJE2w1E7vJD-R2o.png","isPro":false,"fullname":"ShawnYing","user":"ShawnYing","type":"user"},"summary":"Despite strong performance on existing benchmarks, it remains unclear whether large language models can reason over genuinely novel scientific information. Most evaluations score end-to-end RAG pipelines, where reasoning is confounded with retrieval and toolchain choices, and the signal is further contaminated by parametric memorization and open-web volatility. We introduce DeR2, a controlled deep-research sandbox that isolates document-grounded reasoning while preserving core difficulties of deep search: multi-step synthesis, denoising, and evidence-based conclusion making. DeR2 decouples evidence access from reasoning via four regimes--Instruction-only, Concepts (gold concepts without documents), Related-only (only relevant documents), and Full-set (relevant documents plus topically related distractors)--yielding interpretable regime gaps that operationalize retrieval loss vs. reasoning loss and enable fine-grained error attribution. To prevent parametric leakage, we apply a two-phase validation that requires parametric failure without evidence while ensuring oracle-concept solvability. To ensure reproducibility, each instance provides a frozen document library (drawn from 2023-2025 theoretical papers) with expert-annotated concepts and validated rationales. Experiments across a diverse set of state-of-the-art foundation models reveal substantial variation and significant headroom: some models exhibit mode-switch fragility, performing worse with the Full-set than with Instruction-only, while others show structural concept misuse, correctly naming concepts but failing to execute them as procedures.","upvotes":19,"discussionId":"6985651d4ad556f294b7ebec","projectPage":"https://retrieval-infused-reasoning-sandbox.github.io/","githubRepo":"https://github.com/Retrieval-Infused-Reasoning-Sandbox/Retrieval-Infused-Reasoning-Sandbox","githubRepoAddedBy":"user","ai_summary":"DeR2 presents a controlled evaluation framework for assessing language models' document-grounded reasoning capabilities by isolating reasoning from retrieval and toolchain decisions.","ai_keywords":["large language models","document-grounded reasoning","retrieval-augmented generation","deep search","multi-step synthesis","denoising","evidence-based conclusion making","parametric memorization","open-web volatility","controlled deep-research sandbox","retrieval loss","reasoning loss","error attribution","two-phase validation","parametric failure","oracle-concept solvability","frozen document library","expert-annotated concepts","validated rationales"],"githubStars":2,"organization":{"_id":"67d1140985ea0644e2f14b99","name":"ByteDance-Seed","fullname":"ByteDance Seed","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6535c9e88bde2fae19b6fb25/flkDUqd_YEuFsjeNET3r-.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6377366acc034ef804cf0aef","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6377366acc034ef804cf0aef/JHsET8LJE2w1E7vJD-R2o.png","isPro":false,"fullname":"ShawnYing","user":"ShawnYing","type":"user"},{"_id":"655ddff3be545cd245744c38","avatarUrl":"/avatars/e6ac794e4298d0390ec9dc3a69c03aa1.svg","isPro":false,"fullname":"Jin Chen","user":"VanZieks","type":"user"},{"_id":"64a948d9723beceb2f13f5eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a948d9723beceb2f13f5eb/UQMO7LczJpabtbcoNt1zQ.jpeg","isPro":false,"fullname":"XinLi","user":"XINLI1997","type":"user"},{"_id":"67dbf07f9d821d38905d145d","avatarUrl":"/avatars/e806467d2c0f4f642c8d4906b0855817.svg","isPro":false,"fullname":"guixin","user":"Ross12","type":"user"},{"_id":"6849210e9e8f95397d320e15","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/xTM8z6IlPzjmTWwKgu1u2.png","isPro":false,"fullname":"yaoyifan","user":"yyf12","type":"user"},{"_id":"638efcf4c67af472d316d424","avatarUrl":"/avatars/97a57859d7d87a3a8f1bb41d32a72bc2.svg","isPro":false,"fullname":"Ge Zhang","user":"zhangysk","type":"user"},{"_id":"63369da91ba5d5ece24118a4","avatarUrl":"/avatars/67889e1ecadb04100a77bc8b5284c6fd.svg","isPro":false,"fullname":"wuyuhao","user":"mozhu","type":"user"},{"_id":"658c32f76f797d950b810250","avatarUrl":"/avatars/a0c1bd216057f139be244cd86e5b2e80.svg","isPro":false,"fullname":"zheng weihua","user":"ZWHTXY","type":"user"},{"_id":"69031a790716493a711c528d","avatarUrl":"/avatars/75453e76ccad2accaa17eb8c9e140ecc.svg","isPro":false,"fullname":"ๅˆ˜ๆ€ๆ€ก","user":"Kemmy0616","type":"user"},{"_id":"69859c0e8f914275533f8eb9","avatarUrl":"/avatars/b65d19d267192fc79451f3475df15eb0.svg","isPro":false,"fullname":"pyz","user":"whoisbean","type":"user"},{"_id":"67337dffe77108f3cce35005","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67337dffe77108f3cce35005/2mlcK-k7G6UgijO2J3QoP.jpeg","isPro":false,"fullname":"yunwenLi","user":"JunoLi622","type":"user"},{"_id":"65d2251f98b4a470bf6a26e3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d2251f98b4a470bf6a26e3/C4T0LHYGejrI9mu_k3M8p.jpeg","isPro":false,"fullname":"xts","user":"xtsssss","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"67d1140985ea0644e2f14b99","name":"ByteDance-Seed","fullname":"ByteDance Seed","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6535c9e88bde2fae19b6fb25/flkDUqd_YEuFsjeNET3r-.png"}}">
Papers
arxiv:2601.21937

Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities

Published on Jan 29
ยท Submitted by
ShawnYing
on Feb 6
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

DeR2 presents a controlled evaluation framework for assessing language models' document-grounded reasoning capabilities by isolating reasoning from retrieval and toolchain decisions.

AI-generated summary

Despite strong performance on existing benchmarks, it remains unclear whether large language models can reason over genuinely novel scientific information. Most evaluations score end-to-end RAG pipelines, where reasoning is confounded with retrieval and toolchain choices, and the signal is further contaminated by parametric memorization and open-web volatility. We introduce DeR2, a controlled deep-research sandbox that isolates document-grounded reasoning while preserving core difficulties of deep search: multi-step synthesis, denoising, and evidence-based conclusion making. DeR2 decouples evidence access from reasoning via four regimes--Instruction-only, Concepts (gold concepts without documents), Related-only (only relevant documents), and Full-set (relevant documents plus topically related distractors)--yielding interpretable regime gaps that operationalize retrieval loss vs. reasoning loss and enable fine-grained error attribution. To prevent parametric leakage, we apply a two-phase validation that requires parametric failure without evidence while ensuring oracle-concept solvability. To ensure reproducibility, each instance provides a frozen document library (drawn from 2023-2025 theoretical papers) with expert-annotated concepts and validated rationales. Experiments across a diverse set of state-of-the-art foundation models reveal substantial variation and significant headroom: some models exhibit mode-switch fragility, performing worse with the Full-set than with Instruction-only, while others show structural concept misuse, correctly naming concepts but failing to execute them as procedures.

Community

Paper submitter

Despite strong performance on existing benchmarks, it remains unclear whether large language models can reason over genuinely novel scientific information. Most evaluations score end-to-end RAG pipelines, where reasoning is confounded with retrieval and toolchain choices, and the signal is further contaminated by parametric memorization and open-web volatility. We introduce DeR2, a controlled deep-research sandbox that isolates document-grounded reasoning while preserving core difficulties of deep search: multi-step synthesis, denoising, and evidence-based conclusion making. DeR2 decouples evidence access from reasoning via four regimes--Instruction-only, Concepts (gold concepts without documents), Related-only (only relevant documents), and Full-set (relevant documents plus topically related distractors)--yielding interpretable regime gaps that operationalize retrieval loss vs. reasoning loss and enable fine-grained error attribution. To prevent parametric leakage, we apply a two-phase validation that requires parametric failure without evidence while ensuring oracle-concept solvability. To ensure reproducibility, each instance provides a frozen document library (drawn from 2023-2025 theoretical papers) with expert-annotated concepts and validated rationales. Experiments across a diverse set of state-of-the-art foundation models reveal substantial variation and significant headroom: some models exhibit mode-switch fragility, performing worse with the Full-set than with Instruction-only, while others show structural concept misuse, correctly naming concepts but failing to execute them as procedures.

ยท

arXivLens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/retrieval-infused-reasoning-sandbox-a-benchmark-for-decoupling-retrieval-and-reasoning-capabilities-6192-add5f3f6

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

arXivLens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/retrieval-infused-reasoning-sandbox-a-benchmark-for-decoupling-retrieval-and-reasoning-capabilities-6192-add5f3f6

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.21937 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.21937 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.21937 in a Space README.md to link it from this page.

Collections including this paper 3