Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization
\n","updatedAt":"2026-02-09T18:51:51.590Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9179,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.40825918316841125},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[],"isReport":false}},{"id":"698a8c92a69be86b878b78fb","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-10T01:40:34.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Knowledge is Not Enough: Injecting RL Skills for Continual Adaptation](https://huggingface.co/papers/2601.11258) (2026)\n* [Mistake Notebook Learning: Batch-Clustered Failures for Training-Free Agent Adaptation](https://huggingface.co/papers/2512.11485) (2025)\n* [CLEANER: Self-Purified Trajectories Boost Agentic Reinforcement Learning](https://huggingface.co/papers/2601.15141) (2026)\n* [Dr. Zero: Self-Evolving Search Agents without Training Data](https://huggingface.co/papers/2601.07055) (2026)\n* [Yunjue Agent Tech Report: A Fully Reproducible, Zero-Start In-Situ Self-Evolving Agent System for Open-Ended Tasks](https://huggingface.co/papers/2601.18226) (2026)\n* [Self-Consolidation for Self-Evolving Agents](https://huggingface.co/papers/2602.01966) (2026)\n* [DARC: Decoupled Asymmetric Reasoning Curriculum for LLM Evolution](https://huggingface.co/papers/2601.13761) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2026-02-10T01:40:34.671Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7215104699134827},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.04811","authors":[{"_id":"698a2c901b2dc6b37d61ae22","name":"Jiarui Yuan","hidden":false},{"_id":"698a2c901b2dc6b37d61ae23","name":"Tailin Jin","hidden":false},{"_id":"698a2c901b2dc6b37d61ae24","name":"Weize Chen","hidden":false},{"_id":"698a2c901b2dc6b37d61ae25","name":"Zeyuan Liu","hidden":false},{"_id":"698a2c901b2dc6b37d61ae26","name":"Zhiyuan Liu","hidden":false},{"_id":"698a2c901b2dc6b37d61ae27","name":"Maosong Sun","hidden":false}],"publishedAt":"2026-02-04T17:58:32.000Z","submittedOnDailyAt":"2026-02-09T16:21:51.582Z","title":"SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"True self-evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new'' knowledge may appear in pre-training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE-Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo-novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the Open-Book Paradox, where training with reference documentation inhibits retention, requiring \"Closed-Book Training\" to force knowledge compression into weights; (2) the RL Gap, where standard RL fails to internalize new knowledge completely due to PPO clipping and negative gradients; and (3) the viability of Self-Play for internalization, proving models can learn from self-generated, noisy tasks when coupled with SFT, but not RL. Overall, SE-Bench establishes a rigorous diagnostic platform for self-evolution with knowledge internalization. Our code and dataset can be found at https://github.com/thunlp/SE-Bench.","upvotes":2,"discussionId":"698a2c901b2dc6b37d61ae28","ai_summary":"SE-Bench presents a diagnostic environment that obscures NumPy's API to evaluate agents' ability to internally store and utilize novel knowledge without external documentation, revealing challenges in knowledge retention and internalization through different training approaches.","ai_keywords":["self-evolution","lifelong learning","knowledge internalization","Open-Book Paradox","Closed-Book Training","RL Gap","PPO clipping","negative gradients","Self-Play","SFT"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"648312243b7fe59c876c0dca","avatarUrl":"/avatars/c26ad76cd213529e4670bb599b8199bb.svg","isPro":false,"fullname":"weize","user":"weizechen","type":"user"},{"_id":"684d57f26e04c265777ead3f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/cuOj-bQqukSZreXgUJlfm.png","isPro":false,"fullname":"Joakim Lee","user":"Reinforcement4All","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
SE-Bench presents a diagnostic environment that obscures NumPy's API to evaluate agents' ability to internally store and utilize novel knowledge without external documentation, revealing challenges in knowledge retention and internalization through different training approaches.
AI-generated summary
True self-evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new'' knowledge may appear in pre-training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE-Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo-novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the Open-Book Paradox, where training with reference documentation inhibits retention, requiring "Closed-Book Training" to force knowledge compression into weights; (2) the RL Gap, where standard RL fails to internalize new knowledge completely due to PPO clipping and negative gradients; and (3) the viability of Self-Play for internalization, proving models can learn from self-generated, noisy tasks when coupled with SFT, but not RL. Overall, SE-Bench establishes a rigorous diagnostic platform for self-evolution with knowledge internalization. Our code and dataset can be found at https://github.com/thunlp/SE-Bench.