Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-11-25T01:35:30.554Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.703376293182373},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2511.17490","authors":[{"_id":"6923c52fb5612535ed9558c0","name":"Yolo Yunlong Tang","hidden":false},{"_id":"6923c52fb5612535ed9558c1","name":"Daiki Shimada","hidden":false},{"_id":"6923c52fb5612535ed9558c2","user":{"_id":"639f8277beb95d698de007dd","avatarUrl":"/avatars/57f223ccd9d3cb03166ccf0e41361c58.svg","isPro":false,"fullname":"HangHua","user":"hhua2","type":"user"},"name":"Hang Hua","status":"claimed_verified","statusLastChangedAt":"2025-11-27T10:00:34.628Z","hidden":false},{"_id":"6923c52fb5612535ed9558c3","name":"Chao Huang","hidden":false},{"_id":"6923c52fb5612535ed9558c4","name":"Jing Bi","hidden":false},{"_id":"6923c52fb5612535ed9558c5","name":"Rogerio Feris","hidden":false},{"_id":"6923c52fb5612535ed9558c6","name":"Chenliang Xu","hidden":false}],"publishedAt":"2025-11-21T18:47:09.000Z","submittedOnDailyAt":"2025-11-24T00:09:06.782Z","title":"Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination","submittedOnDailyBy":{"_id":"6344c87f0f69ad8aa61dfcf6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6344c87f0f69ad8aa61dfcf6/tTVHu2l2aiAnK160vgT6u.jpeg","isPro":false,"fullname":"Yolo Y. Tang","user":"yunlong10","type":"user"},"summary":"Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning.","upvotes":22,"discussionId":"6923c52fb5612535ed9558c7","projectPage":"https://yunlong10.github.io/Video-R4/","githubRepo":"https://github.com/yunlong10/Video-R4","githubRepoAddedBy":"user","ai_summary":"Video-R4, a video reasoning LMM, uses iterative visual rumination to improve text-rich video QA by selecting, zooming, and re-encoding frames, achieving state-of-the-art results on various QA tasks.","ai_keywords":["video QA models","visual rumination","Video-R4","LMM","frames","re-encoding","reasoning state","Video-R4-CoT-17k","Video-R4-RL-30k","multi-stage rumination learning","SFT","GRPO-based RL","M4-ViteVQA","multi-page document QA","slides QA","generic video QA"],"githubStars":27},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6344c87f0f69ad8aa61dfcf6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6344c87f0f69ad8aa61dfcf6/tTVHu2l2aiAnK160vgT6u.jpeg","isPro":false,"fullname":"Yolo Y. Tang","user":"yunlong10","type":"user"},{"_id":"67257ee0938e718957c9c100","avatarUrl":"/avatars/db1f792ee1a5d860ea4e98c11a016a1b.svg","isPro":false,"fullname":"Chao Huang","user":"ChaoHuangCS","type":"user"},{"_id":"66805e103dd5f4c44c1c939a","avatarUrl":"/avatars/ef2eebf1d61f86dde3397471ca180b95.svg","isPro":false,"fullname":"Zeliang Zhang","user":"zeliang0426","type":"user"},{"_id":"65763434a4ee9a4fe7cfb156","avatarUrl":"/avatars/2c4a23ff309f750dd9c0d67bd9fd7abc.svg","isPro":false,"fullname":"Susan Liang","user":"susanliang","type":"user"},{"_id":"62eb469dade76f18dd4f0dea","avatarUrl":"/avatars/c558254a0352d115a73febb90bb9370f.svg","isPro":true,"fullname":"Pinxin Liu","user":"pliu23","type":"user"},{"_id":"636377270ec58fc3c0731af4","avatarUrl":"/avatars/39e8e4babb01724aceb6ec2bbb27a7d9.svg","isPro":false,"fullname":"Yizhi Song","user":"song630","type":"user"},{"_id":"654d82b8d2db4280d9351bc5","avatarUrl":"/avatars/bcf1a67ea1282bf2124b6c964c717232.svg","isPro":false,"fullname":"Xinyi Liu","user":"Xinyi125","type":"user"},{"_id":"64d4615cf8082bf19b916492","avatarUrl":"/avatars/8e1b59565ec5e4b31090cf1b911781b9.svg","isPro":false,"fullname":"wongyukim","user":"wongyukim","type":"user"},{"_id":"65f0eb3c4b0b4771dbde1a40","avatarUrl":"/avatars/3507afae5566aefe9dd187a8613e9fd3.svg","isPro":false,"fullname":"JunJiaGuo","user":"JunJiaGuo","type":"user"},{"_id":"679e5c508b9b6f902e3ec39b","avatarUrl":"/avatars/f496e504f28083d35ff2cf8b870db5a0.svg","isPro":false,"fullname":"lsong","user":"lsong111","type":"user"},{"_id":"643131d2faac480f27c1c3d1","avatarUrl":"/avatars/0413f950c961f36e20545f1bd5efed45.svg","isPro":false,"fullname":"vincent","user":"mississippiu","type":"user"},{"_id":"650d492c9169ea73317d867c","avatarUrl":"/avatars/e9cfdbe1a34ac63cfb3f8b4365eab104.svg","isPro":false,"fullname":"Yuexi Shen","user":"yuexishen","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Video-R4, a video reasoning LMM, uses iterative visual rumination to improve text-rich video QA by selecting, zooming, and re-encoding frames, achieving state-of-the-art results on various QA tasks.
AI-generated summary
Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning.