Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Watch and Learn: Learning to Use Computers from Online Videos
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-10-08T01:35:22.418Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.706807017326355},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2510.04673","authors":[{"_id":"68e481d4e4e093a7044e4d59","user":{"_id":"63d19365b30415240fd6515b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d19365b30415240fd6515b/eOEYSsyDTfPTDrR6Cm5Jn.jpeg","isPro":false,"fullname":"Chan Hee Song","user":"chanhee-luke","type":"user"},"name":"Chan Hee Song","status":"claimed_verified","statusLastChangedAt":"2025-12-04T08:51:28.441Z","hidden":false},{"_id":"68e481d4e4e093a7044e4d5a","name":"Yiwen Song","hidden":false},{"_id":"68e481d4e4e093a7044e4d5b","name":"Palash Goyal","hidden":false},{"_id":"68e481d4e4e093a7044e4d5c","user":{"_id":"6477a323dbc2a416f8b852b3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6477a323dbc2a416f8b852b3/mRKW5kT9GASORT4YnaZz0.jpeg","isPro":false,"fullname":"Yu Su","user":"ysu-nlp","type":"user"},"name":"Yu Su","status":"claimed_verified","statusLastChangedAt":"2025-10-11T14:05:37.232Z","hidden":false},{"_id":"68e481d4e4e093a7044e4d5d","name":"Oriana Riva","hidden":false},{"_id":"68e481d4e4e093a7044e4d5e","name":"Hamid Palangi","hidden":false},{"_id":"68e481d4e4e093a7044e4d5f","name":"Tomas Pfister","hidden":false}],"publishedAt":"2025-10-06T10:29:00.000Z","submittedOnDailyAt":"2025-10-07T01:28:38.169Z","title":"Watch and Learn: Learning to Use Computers from Online Videos","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},"summary":"Computer use agents (CUAs) need to plan task workflows grounded in diverse,\never-changing applications and environments, but learning is hindered by the\nscarcity of large-scale, high-quality training data in the target application.\nExisting datasets are domain-specific, static, and costly to annotate, while\ncurrent synthetic data generation methods often yield simplistic or misaligned\ntask demonstrations. To address these limitations, we introduce Watch & Learn\n(W&L), a framework that converts human demonstration videos readily available\non the Internet into executable UI trajectories at scale. Instead of directly\ngenerating trajectories or relying on ad hoc reasoning heuristics, we cast the\nproblem as an inverse dynamics objective: predicting the user's action from\nconsecutive screen states. This formulation reduces manual engineering, is\neasier to learn, and generalizes more robustly across applications. Concretely,\nwe develop an inverse dynamics labeling pipeline with task-aware video\nretrieval, generate over 53k high-quality trajectories from raw web videos, and\ndemonstrate that these trajectories improve CUAs both as in-context\ndemonstrations and as supervised training data. On the challenging OSWorld\nbenchmark, UI trajectories extracted with W&L consistently enhance both\ngeneral-purpose and state-of-the-art frameworks in-context, and deliver\nstronger gains for open-source models under supervised training. These results\nhighlight web-scale human demonstration videos as a practical and scalable\nfoundation for advancing CUAs towards real-world deployment.","upvotes":12,"discussionId":"68e481d4e4e093a7044e4d60","ai_summary":"Watch & Learn converts web demonstration videos into UI trajectories to enhance computer use agents, improving both in-context demonstrations and supervised training.","ai_keywords":["inverse dynamics","task-aware video retrieval","UI trajectories","OSWorld benchmark"],"organization":{"_id":"5e6aca39878b8b2bf9806447","name":"google","fullname":"Google","avatar":"https://cdn-uploads.huggingface.co/production/uploads/5dd96eb166059660ed1ee413/WtA3YYitedOr9n02eHfJe.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"},{"_id":"64440be5af034cdfd69ca3a7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64440be5af034cdfd69ca3a7/qmx24QiDFT29vleCxL9TX.jpeg","isPro":false,"fullname":"Qinghong (Kevin) Lin","user":"KevinQHLin","type":"user"},{"_id":"658a810665df457a55ffcd04","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/658a810665df457a55ffcd04/6Pe0mNao4mlWLIjYEoWv5.jpeg","isPro":false,"fullname":"Linkangheng","user":"Kangheng","type":"user"},{"_id":"648a374f00f7a3374ee64b99","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648a374f00f7a3374ee64b99/YPwSOrronoozwHbJchPn3.jpeg","isPro":true,"fullname":"Caleb Fahlgren","user":"cfahlgren1","type":"user"},{"_id":"6629ff78fa14eaccf08f68d5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/JOWbWg0NugNczST2eTFDo.png","isPro":false,"fullname":"Wyndham heaven","user":"Trybook","type":"user"},{"_id":"66495ca132236a0fad5d3124","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66495ca132236a0fad5d3124/M_hHgHtV0n2OCV28PKRr9.jpeg","isPro":false,"fullname":"Chen Yanxu","user":"CCCCCyx","type":"user"},{"_id":"639c379cdb7c5f35004066cb","avatarUrl":"/avatars/3e435506ee85aa7d2d0ec2174a07462f.svg","isPro":false,"fullname":"Zhenran Xu","user":"imryanxu","type":"user"},{"_id":"63d19365b30415240fd6515b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d19365b30415240fd6515b/eOEYSsyDTfPTDrR6Cm5Jn.jpeg","isPro":false,"fullname":"Chan Hee Song","user":"chanhee-luke","type":"user"},{"_id":"686db5d4af2b856fabbf13aa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6BjMv2LVNoqvbX8fQSTPI.png","isPro":false,"fullname":"V bbbb","user":"Bbbbbnnn","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"5e6aca39878b8b2bf9806447","name":"google","fullname":"Google","avatar":"https://cdn-uploads.huggingface.co/production/uploads/5dd96eb166059660ed1ee413/WtA3YYitedOr9n02eHfJe.png"}}">
Papers
arxiv:2510.04673

Watch and Learn: Learning to Use Computers from Online Videos

Published on Oct 6, 2025
· Submitted by
taesiri
on Oct 7, 2025
Authors:
,
,
Yu Su ,
,
,

Abstract

Watch & Learn converts web demonstration videos into UI trajectories to enhance computer use agents, improving both in-context demonstrations and supervised training.

AI-generated summary

Computer use agents (CUAs) need to plan task workflows grounded in diverse, ever-changing applications and environments, but learning is hindered by the scarcity of large-scale, high-quality training data in the target application. Existing datasets are domain-specific, static, and costly to annotate, while current synthetic data generation methods often yield simplistic or misaligned task demonstrations. To address these limitations, we introduce Watch & Learn (W&L), a framework that converts human demonstration videos readily available on the Internet into executable UI trajectories at scale. Instead of directly generating trajectories or relying on ad hoc reasoning heuristics, we cast the problem as an inverse dynamics objective: predicting the user's action from consecutive screen states. This formulation reduces manual engineering, is easier to learn, and generalizes more robustly across applications. Concretely, we develop an inverse dynamics labeling pipeline with task-aware video retrieval, generate over 53k high-quality trajectories from raw web videos, and demonstrate that these trajectories improve CUAs both as in-context demonstrations and as supervised training data. On the challenging OSWorld benchmark, UI trajectories extracted with W&L consistently enhance both general-purpose and state-of-the-art frameworks in-context, and deliver stronger gains for open-source models under supervised training. These results highlight web-scale human demonstration videos as a practical and scalable foundation for advancing CUAs towards real-world deployment.

Community

Paper submitter

Computer use agents (CUAs) need to plan task workflows grounded in diverse, ever-changing applications and environments, but learning is hindered by the scarcity of large-scale, high-quality training data in the target application. Existing datasets are domain-specific, static, and costly to annotate, while current synthetic data generation methods often yield simplistic or misaligned task demonstrations. To address these limitations, we introduce Watch & Learn (W&L), a framework that converts human demonstration videos readily available on the Internet into executable UI trajectories at scale. Instead of directly generating trajectories or relying on ad hoc reasoning heuristics, we cast the problem as an inverse dynamics objective: predicting the user's action from consecutive screen states. This formulation reduces manual engineering, is easier to learn, and generalizes more robustly across applications. Concretely, we develop an inverse dynamics labeling pipeline with task-aware video retrieval, generate over 53k high-quality trajectories from raw web videos, and demonstrate that these trajectories improve CUAs both as in-context demonstrations and as supervised training data. On the challenging OSWorld benchmark, UI trajectories extracted with W&L consistently enhance both general-purpose and state-of-the-art frameworks in-context, and deliver stronger gains for open-source models under supervised training. These results highlight web-scale human demonstration videos as a practical and scalable foundation for advancing CUAs towards real-world deployment.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.04673 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.04673 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.04673 in a Space README.md to link it from this page.

Collections including this paper 1