Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality
[go: Go Back, main page]

\"profile_dist\"

\n","updatedAt":"2026-02-19T14:33:40.975Z","author":{"_id":"62d6a0c18faee0ac953c51fa","avatarUrl":"/avatars/ca818cebdb089a8d853c5bc4d5e0987b.svg","fullname":"Nitay Calderon","name":"nitay","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4702026844024658},"editors":["nitay"],"editorAvatarUrls":["/avatars/ca818cebdb089a8d853c5bc4d5e0987b.svg"],"reactions":[],"isReport":false}},{"id":"6997bb40c633a08f9cfa1329","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-20T01:39:12.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities](https://huggingface.co/papers/2601.21937) (2026)\n* [Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say\"I Don't Know\"](https://huggingface.co/papers/2602.04853) (2026)\n* [FactNet: A Billion-Scale Knowledge Graph for Multilingual Factual Grounding](https://huggingface.co/papers/2602.03417) (2026)\n* [Evaluating Contextually Mediated Factual Recall in Multilingual Large Language Models](https://huggingface.co/papers/2601.12555) (2026)\n* [EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models](https://huggingface.co/papers/2602.01313) (2026)\n* [LiveNewsBench: Evaluating LLM Web Search Capabilities with Freshly Curated News](https://huggingface.co/papers/2602.13543) (2026)\n* [TraceBack: Multi-Agent Decomposition for Fine-Grained Table Attribution](https://huggingface.co/papers/2602.13059) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-02-20T01:39:12.355Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.69813072681427},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.14080","authors":[{"_id":"699694c21268a6b79e0d031a","name":"Nitay Calderon","hidden":false},{"_id":"699694c21268a6b79e0d031b","name":"Eyal Ben-David","hidden":false},{"_id":"699694c21268a6b79e0d031c","name":"Zorik Gekhman","hidden":false},{"_id":"699694c21268a6b79e0d031d","name":"Eran Ofek","hidden":false},{"_id":"699694c21268a6b79e0d031e","name":"Gal Yona","hidden":false}],"publishedAt":"2026-02-15T10:13:30.000Z","submittedOnDailyAt":"2026-02-19T02:14:32.347Z","title":"Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality","submittedOnDailyBy":{"_id":"62d6a0c18faee0ac953c51fa","avatarUrl":"/avatars/ca818cebdb089a8d853c5bc4d5e0987b.svg","isPro":false,"fullname":"Nitay Calderon","user":"nitay","type":"user"},"summary":"Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.","upvotes":18,"discussionId":"699694c21268a6b79e0d031f","ai_summary":"LLMs demonstrate near-complete factual encoding but struggle with retrieval accessibility, where errors stem from access limitations rather than knowledge gaps, with reasoning improving recall of encoded information.","ai_keywords":["factuality evaluations","LLMs","factual knowledge","encoded facts","recall accessibility","long-tail facts","reverse questions","thinking","reasoning","encoding saturation"],"organization":{"_id":"5e6aca39878b8b2bf9806447","name":"google","fullname":"Google","avatar":"https://cdn-uploads.huggingface.co/production/uploads/5dd96eb166059660ed1ee413/WtA3YYitedOr9n02eHfJe.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62d6a0c18faee0ac953c51fa","avatarUrl":"/avatars/ca818cebdb089a8d853c5bc4d5e0987b.svg","isPro":false,"fullname":"Nitay Calderon","user":"nitay","type":"user"},{"_id":"63d90876255ef6add20d353d","avatarUrl":"/avatars/0544a667fb32555cdfec9b09cf2c5d28.svg","isPro":false,"fullname":"Itay Nakash","user":"itay-nakash","type":"user"},{"_id":"60f82853c53e95176a7c6d45","avatarUrl":"/avatars/16f6ea944a014af6ebe60499f3460784.svg","isPro":false,"fullname":"Hadas Orgad","user":"hadasor","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"663511a0efbee9a1fde43940","avatarUrl":"/avatars/ad124ac100829359f67052c730e0a0d4.svg","isPro":false,"fullname":"Omer Nahum","user":"omer6nahum","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"610c1e1a423fe7d80928aefd","avatarUrl":"/avatars/8591584d678cf7fddace01e223953a63.svg","isPro":false,"fullname":"Itay Itzhak","user":"itay1itzhak","type":"user"},{"_id":"638dbf006b5c2ccc6240d6fc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638dbf006b5c2ccc6240d6fc/7jZlE4S35f15PLSN3_bVD.jpeg","isPro":false,"fullname":"Tomer Ashuach","user":"Tomertech","type":"user"},{"_id":"60d84af7eac5e05d4594f010","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60d84af7eac5e05d4594f010/KnGxUR7OUOAGg0S67tRaY.png","isPro":false,"fullname":"Alan Arazi","user":"alana89","type":"user"},{"_id":"6984dfb993351a77acaa9663","avatarUrl":"/avatars/2ecbf9f2c82a66eb9b4d639895715c88.svg","isPro":false,"fullname":"Somchai Rattanakorn","user":"hate2far","type":"user"},{"_id":"60ef001bed64a34082bfa0dd","avatarUrl":"/avatars/78e4daeac169edbf4dc42fbed9b50d59.svg","isPro":false,"fullname":"Omer Shubi","user":"scaperex","type":"user"},{"_id":"6996df51e54c0c1959badd5a","avatarUrl":"/avatars/d28c03657c867034fbce82c04ee08e18.svg","isPro":false,"fullname":"Shubham Yadav","user":"Mabhshu","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"5e6aca39878b8b2bf9806447","name":"google","fullname":"Google","avatar":"https://cdn-uploads.huggingface.co/production/uploads/5dd96eb166059660ed1ee413/WtA3YYitedOr9n02eHfJe.png"}}">
Papers
arxiv:2602.14080

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

Published on Feb 15
· Submitted by
Nitay Calderon
on Feb 19
Authors:
,
,
,
,

Abstract

LLMs demonstrate near-complete factual encoding but struggle with retrieval accessibility, where errors stem from access limitations rather than knowledge gaps, with reasoning improving recall of encoded information.

AI-generated summary

Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.

Community

Paper submitter

Why do frontier LLMs make factual errors?
Is it because they never learned the fact…
or because they can’t access knowledge they already encoded?
This paper shows:
The bottleneck is not encoding; it is recall.

Paper submitter

profile_dist

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.14080 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.14080 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.14080 in a Space README.md to link it from this page.

Collections including this paper 1