https://arxivlens.com/PaperView/Details/rynnbrain-open-embodied-foundation-models-821-0556da0b

Executive Summary
Detailed Breakdown
Practical Applications

\n","updatedAt":"2026-02-19T14:21:17.910Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7936084270477295},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}},{"id":"6997bb79318f3c86ddf49303","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-20T01:40:09.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [RoboBrain 2.5: Depth in Sight, Time in Mind](https://huggingface.co/papers/2601.14352) (2026)\n* [DM0: An Embodied-Native Vision-Language-Action Model towards Physical AI](https://huggingface.co/papers/2602.14974) (2026)\n* [Thinker: A vision-language foundation model for embodied intelligence](https://huggingface.co/papers/2601.21199) (2026)\n* [Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training](https://huggingface.co/papers/2512.24125) (2025)\n* [Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization](https://huggingface.co/papers/2601.12993) (2026)\n* [Action-Sketcher: From Reasoning to Action via Visual Sketches for Long-Horizon Robotic Manipulation](https://huggingface.co/papers/2601.01618) (2026)\n* [ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation](https://huggingface.co/papers/2602.11598) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-02-20T01:40:09.790Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7622317671775818},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.14979","authors":[{"_id":"69968e1e1268a6b79e0d02f1","user":{"_id":"64731a68a7f23affe7736d3d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/8wESKLFcS2ltPzL-wpG4Z.jpeg","isPro":false,"fullname":"Ronghao Dang","user":"RH-Dang","type":"user"},"name":"Ronghao Dang","status":"admin_assigned","statusLastChangedAt":"2026-02-19T10:00:05.563Z","hidden":false},{"_id":"69968e1e1268a6b79e0d02f2","user":{"_id":"66224557c61c7fbd98099079","avatarUrl":"/avatars/a4f2144585c808865c73b5b7f0087c1f.svg","isPro":false,"fullname":"Jiayan Guo","user":"SpaceProduct","type":"user"},"name":"Jiayan Guo","status":"admin_assigned","statusLastChangedAt":"2026-02-19T10:00:12.888Z","hidden":false},{"_id":"69968e1e1268a6b79e0d02f3","name":"Bohan Hou","hidden":false},{"_id":"69968e1e1268a6b79e0d02f4","user":{"_id":"609115c79a8bcaa437b234a9","avatarUrl":"/avatars/1631a91030703d8397133363cf82c863.svg","isPro":false,"fullname":"Leng Sicong","user":"Sicong","type":"user"},"name":"Sicong Leng","status":"admin_assigned","statusLastChangedAt":"2026-02-19T10:00:23.420Z","hidden":false},{"_id":"69968e1e1268a6b79e0d02f5","user":{"_id":"6388af095a3d2a335622cb7c","avatarUrl":"/avatars/f548ce6a902cee8bdc74179bcd45534c.svg","isPro":false,"fullname":"Kehan Li","user":"lkhl","type":"user"},"name":"Kehan Li","status":"claimed_verified","statusLastChangedAt":"2026-02-19T09:51:43.746Z","hidden":false},{"_id":"69968e1e1268a6b79e0d02f6","name":"Xin Li","hidden":false},{"_id":"69968e1e1268a6b79e0d02f7","user":{"_id":"67a6082b2f32323bfb5e6641","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/NZvjm4Kapc7AgReexgRgi.png","isPro":false,"fullname":"jiangpin","user":"jiangpinliu","type":"user"},"name":"Jiangpin Liu","status":"admin_assigned","statusLastChangedAt":"2026-02-19T10:00:30.219Z","hidden":false},{"_id":"69968e1e1268a6b79e0d02f8","user":{"_id":"67fcc97cede5c434e0cc37e3","avatarUrl":"/avatars/b07e0a4744c1045828a621146ee6d3c2.svg","isPro":false,"fullname":"yunxuan mao","user":"maoyunxuan","type":"user"},"name":"Yunxuan Mao","status":"admin_assigned","statusLastChangedAt":"2026-02-19T10:00:35.800Z","hidden":false},{"_id":"69968e1e1268a6b79e0d02f9","name":"Zhikai Wang","hidden":false},{"_id":"69968e1e1268a6b79e0d02fa","name":"Yuqian Yuan","hidden":false},{"_id":"69968e1e1268a6b79e0d02fb","user":{"_id":"678390f0cf658d598cdd169c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/RIRRdQsgtr1TdspMmOz1w.png","isPro":false,"fullname":"Minghao Zhu","user":"ZMHH-H","type":"user"},"name":"Minghao Zhu","status":"claimed_verified","statusLastChangedAt":"2026-02-20T08:37:28.674Z","hidden":false},{"_id":"69968e1e1268a6b79e0d02fc","user":{"_id":"667e2e4ebfbdb7d21df57084","avatarUrl":"/avatars/dfc6e9e2805c6f22773495c4f02399b8.svg","isPro":false,"fullname":"Xiao Lin","user":"Chrislin21","type":"user"},"name":"Xiao Lin","status":"claimed_verified","statusLastChangedAt":"2026-02-19T11:48:50.334Z","hidden":false},{"_id":"69968e1e1268a6b79e0d02fd","name":"Yang Bai","hidden":false},{"_id":"69968e1e1268a6b79e0d02fe","name":"Qian Jiang","hidden":false},{"_id":"69968e1e1268a6b79e0d02ff","name":"Yaxi Zhao","hidden":false},{"_id":"69968e1e1268a6b79e0d0300","name":"Minghua Zeng","hidden":false},{"_id":"69968e1e1268a6b79e0d0301","user":{"_id":"64bbe6904d2052b1aaf4f2d7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bbe6904d2052b1aaf4f2d7/g0A43kSoy2Wba5JZ2Hcry.jpeg","isPro":false,"fullname":"junlong gao","user":"jlgao23","type":"user"},"name":"Junlong Gao","status":"admin_assigned","statusLastChangedAt":"2026-02-19T10:01:11.266Z","hidden":false},{"_id":"69968e1e1268a6b79e0d0302","name":"Yuming Jiang","hidden":false},{"_id":"69968e1e1268a6b79e0d0303","name":"Jun Cen","hidden":false},{"_id":"69968e1e1268a6b79e0d0304","user":{"_id":"65fd82762bf2cd20ddaa193f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/yBYbWp_mT7UusYdkqtAvw.png","isPro":false,"fullname":"Siteng Huang","user":"huangsiteng","type":"user"},"name":"Siteng Huang","status":"claimed_verified","statusLastChangedAt":"2026-02-19T09:51:46.265Z","hidden":false},{"_id":"69968e1e1268a6b79e0d0305","name":"Liuyi Wang","hidden":false},{"_id":"69968e1e1268a6b79e0d0306","name":"Wenqiao Zhang","hidden":false},{"_id":"69968e1e1268a6b79e0d0307","name":"Chengju Liu","hidden":false},{"_id":"69968e1e1268a6b79e0d0308","name":"Jianfei Yang","hidden":false},{"_id":"69968e1e1268a6b79e0d0309","name":"Shijian Lu","hidden":false},{"_id":"69968e1e1268a6b79e0d030a","name":"Deli Zhao","hidden":false}],"publishedAt":"2026-02-13T18:59:56.000Z","submittedOnDailyAt":"2026-02-19T02:00:43.226Z","title":"RynnBrain: Open Embodied Foundation Models","submittedOnDailyBy":{"_id":"609115c79a8bcaa437b234a9","avatarUrl":"/avatars/1631a91030703d8397133363cf82c863.svg","isPro":false,"fullname":"Leng Sicong","user":"Sicong","type":"user"},"summary":"Despite rapid progress in multimodal foundation models, embodied intelligence community still lacks a unified, physically grounded foundation model that integrates perception, reasoning, and planning within real-world spatial-temporal dynamics. We introduce RynnBrain, an open-source spatiotemporal foundation model for embodied intelligence. RynnBrain strengthens four core capabilities in a unified framework: comprehensive egocentric understanding, diverse spatiotemporal localization, physically grounded reasoning, and physics-aware planning. The RynnBrain family comprises three foundation model scales (2B, 8B, and 30B-A3B MoE) and four post-trained variants tailored for downstream embodied tasks (i.e., RynnBrain-Nav, RynnBrain-Plan, and RynnBrain-VLA) or complex spatial reasoning tasks (i.e., RynnBrain-CoP). In terms of extensive evaluations on 20 embodied benchmarks and 8 general vision understanding benchmarks, our RynnBrain foundation models largely outperform existing embodied foundation models by a significant margin. The post-trained model suite further substantiates two key potentials of the RynnBrain foundation model: (i) enabling physically grounded reasoning and planning, and (ii) serving as a strong pretrained backbone that can be efficiently adapted to diverse embodied tasks.","upvotes":32,"discussionId":"69968e1e1268a6b79e0d030b","projectPage":"https://alibaba-damo-academy.github.io/RynnBrain.github.io/","githubRepo":"https://github.com/alibaba-damo-academy/RynnBrain","githubRepoAddedBy":"user","ai_summary":"RynnBrain is an open-source spatiotemporal foundation model for embodied intelligence that unifies perception, reasoning, and planning capabilities across multiple scales and task-specific variants.","ai_keywords":["multimodal foundation models","embodied intelligence","spatiotemporal foundation model","egocentric understanding","spatiotemporal localization","physically grounded reasoning","physics-aware planning","MoE","post-trained variants","embodied foundation models","vision understanding benchmarks"],"githubStars":414,"organization":{"_id":"6808e7522a4d69d5111da55f","name":"Alibaba-DAMO-Academy","fullname":"DAMO Academy","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6808e64de5dd22427c006e10/9J3vdB62CdeTOd_YrGh9w.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"609115c79a8bcaa437b234a9","avatarUrl":"/avatars/1631a91030703d8397133363cf82c863.svg","isPro":false,"fullname":"Leng Sicong","user":"Sicong","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"63913b120cf6b11c487ca31d","avatarUrl":"/avatars/aec44edd5470dd6e767e0a25efd6fb5d.svg","isPro":true,"fullname":"Xin Li","user":"lixin4ever","type":"user"},{"_id":"65fd82762bf2cd20ddaa193f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/yBYbWp_mT7UusYdkqtAvw.png","isPro":false,"fullname":"Siteng Huang","user":"huangsiteng","type":"user"},{"_id":"64d511b502e58cc1fdc49c64","avatarUrl":"/avatars/ac199ba6fd9eb4a0b16c5bc2a6ac8f03.svg","isPro":false,"fullname":"zhikai wang","user":"cloudcatcher","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"6388af095a3d2a335622cb7c","avatarUrl":"/avatars/f548ce6a902cee8bdc74179bcd45534c.svg","isPro":false,"fullname":"Kehan Li","user":"lkhl","type":"user"},{"_id":"67a6082b2f32323bfb5e6641","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/NZvjm4Kapc7AgReexgRgi.png","isPro":false,"fullname":"jiangpin","user":"jiangpinliu","type":"user"},{"_id":"65bb837dbfb878f46c77de4c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65bb837dbfb878f46c77de4c/23gZ_lBEwyoqjexFy9QLD.jpeg","isPro":true,"fullname":"Prithiv Sakthi","user":"prithivMLmods","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6996c3577253d482ed50bc96","avatarUrl":"/avatars/0311a89119ef4c6449a45d47d480b618.svg","isPro":false,"fullname":"Yuesong Wang","user":"EigenBuffalo","type":"user"},{"_id":"6984dfbc7c18014a36e846ff","avatarUrl":"/avatars/ae61f73399ff80dfd5b7c7066dd9e76d.svg","isPro":false,"fullname":"Fuyu Zhang","user":"agrafic","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":3,"organization":{"_id":"6808e7522a4d69d5111da55f","name":"Alibaba-DAMO-Academy","fullname":"DAMO Academy","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6808e64de5dd22427c006e10/9J3vdB62CdeTOd_YrGh9w.jpeg"}}">

Papers

arxiv:2602.14979

RynnBrain: Open Embodied Foundation Models

Published on Feb 13

· Submitted by

Leng Sicong on Feb 19

#3 Paper of the day

DAMO Academy

Upvote

Authors:

Ronghao Dang ,

Jiayan Guo ,

Sicong Leng ,

Kehan Li ,

Jiangpin Liu ,

Yunxuan Mao ,

Minghao Zhu ,

Xiao Lin ,

Junlong Gao ,

Siteng Huang ,

Abstract

RynnBrain is an open-source spatiotemporal foundation model for embodied intelligence that unifies perception, reasoning, and planning capabilities across multiple scales and task-specific variants.

AI-generated summary

Despite rapid progress in multimodal foundation models, embodied intelligence community still lacks a unified, physically grounded foundation model that integrates perception, reasoning, and planning within real-world spatial-temporal dynamics. We introduce RynnBrain, an open-source spatiotemporal foundation model for embodied intelligence. RynnBrain strengthens four core capabilities in a unified framework: comprehensive egocentric understanding, diverse spatiotemporal localization, physically grounded reasoning, and physics-aware planning. The RynnBrain family comprises three foundation model scales (2B, 8B, and 30B-A3B MoE) and four post-trained variants tailored for downstream embodied tasks (i.e., RynnBrain-Nav, RynnBrain-Plan, and RynnBrain-VLA) or complex spatial reasoning tasks (i.e., RynnBrain-CoP). In terms of extensive evaluations on 20 embodied benchmarks and 8 general vision understanding benchmarks, our RynnBrain foundation models largely outperform existing embodied foundation models by a significant margin. The post-trained model suite further substantiates two key potentials of the RynnBrain foundation model: (i) enabling physically grounded reasoning and planning, and (ii) serving as a strong pretrained backbone that can be efficiently adapted to diverse embodied tasks.

View arXiv page View PDF Project page GitHub 414 Add to collection

Community

Sicong

Paper author Paper submitter 2 days ago

🚀 We’re excited to release our paper and fully open-source RynnBrain — an embodied foundation model designed as a unified cognitive brain for real-world agents.

Unlike conventional VLMs that reason purely in text or static images, RynnBrain is explicitly grounded in physical space and time, integrating egocentric perception, spatiotemporal memory, physically grounded reasoning, and physics-aware planning in a single model.

🧠 What’s fundamentally new?
RynnBrain introduces a spatio-temporal foundation model for embodied intelligence, where reasoning is no longer detached from the physical world:
• Agents can remember object locations across time, not just within a single frame
• Reasoning is interleaved with spatial grounding (text ⇄ coordinates), reducing hallucination
• Planning outputs are directly executable, with objects, areas, affordances, and trajectories grounded in space

📈 Scale & training
We pretrain RynnBrain on ~20M high-quality embodied training pairs, spanning object cognition, spatial reasoning, grounding, trajectory prediction, and manipulation planning.
To make this feasible, we introduce RynnScale, a load-balanced spatiotemporal training framework that improves training efficiency by ~2× under the same compute budget, while preserving stability across dense and MoE models.

🏆 Strong empirical results
Across 20 embodied benchmarks + 8 general vision benchmarks, RynnBrain consistently outperforms existing embodied foundation models:
• Large gains in spatial reasoning, egocentric cognition, and fine-grained localization
• Competitive or superior performance to strong proprietary systems (e.g., Gemini Robotics ER-style models) under comparable settings

Post-trained variants further validate its versatility:
• RynnBrain-CoP: physically grounded chain-of-point reasoning
• RynnBrain-Nav: SOTA results on VLN benchmarks (R2R, RxR)
• RynnBrain-Plan: spatially explicit manipulation planning
• RynnBrain-VLA: stronger downstream vision-language-action execution

📊 We also introduce RynnBrain-Bench
Existing benchmarks fail to evaluate long-horizon spatiotemporal grounding.
RynnBrain-Bench fills this gap with 21 fine-grained embodied capabilities, covering object cognition, spatial cognition, grounding, and pointing across full episodic memory.

📦 Fully open-source release
• Model code & checkpoints (2B, 8B, MoE 30B-A3B)
• Complete training & fine-tuning framework
• RynnBrain-Bench benchmark suite
• Recipes for navigation, planning, and action workflows

🔮 Why this matters
Embodied intelligence needs more than language fluency — it needs memory, spatial grounding, and physical consistency.
RynnBrain is a step toward physically grounded general intelligence, offering a reproducible, extensible foundation for agents that perceive, remember, reason, and act in the real world.