Kudos \n\n@gsarch\n\t and team. I've featured this paper in my AI research newsletter https://www.aitidbits.ai/p/july-4th-2024#:~:text=Multimodal-,CMU,-and%20Google%20propose
\nLooking forward to more novel papers and methods.
\n","updatedAt":"2024-07-05T12:01:51.873Z","author":{"_id":"6047c7582d91124a58b0da44","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6047c7582d91124a58b0da44/omyKMSweUbwCbyZaZwvIM.jpeg","fullname":"Sahar M","name":"saharmor","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.879044771194458},"editors":["saharmor"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6047c7582d91124a58b0da44/omyKMSweUbwCbyZaZwvIM.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2406.14596","authors":[{"_id":"6679c53e3463f1209e2f9c19","user":{"_id":"6666828b15ec3fd28b52a1bd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6666828b15ec3fd28b52a1bd/jRPAQmvg_J6J84Louar6P.jpeg","isPro":true,"fullname":"Gabriel H Sarch","user":"gsarch","type":"user"},"name":"Gabriel Sarch","status":"claimed_verified","statusLastChangedAt":"2024-06-26T08:05:08.809Z","hidden":false},{"_id":"6679c53e3463f1209e2f9c1a","user":{"_id":"664aebe829eadb3ab4e4ca3f","avatarUrl":"/avatars/548d5656e082c8959ac78b883f0805af.svg","isPro":false,"fullname":"Lawrence Jang","user":"ljang0","type":"user"},"name":"Lawrence Jang","status":"claimed_verified","statusLastChangedAt":"2024-09-13T07:21:06.325Z","hidden":false},{"_id":"6679c53e3463f1209e2f9c1b","name":"Michael J. Tarr","hidden":false},{"_id":"6679c53e3463f1209e2f9c1c","name":"William W. Cohen","hidden":false},{"_id":"6679c53e3463f1209e2f9c1d","name":"Kenneth Marino","hidden":false},{"_id":"6679c53e3463f1209e2f9c1e","name":"Katerina Fragkiadaki","hidden":false}],"publishedAt":"2024-06-20T17:45:02.000Z","submittedOnDailyAt":"2024-06-24T17:44:04.699Z","title":"ICAL: Continual Learning of Multimodal Agents by Transforming\n Trajectories into Actionable Insights","submittedOnDailyBy":{"_id":"6666828b15ec3fd28b52a1bd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6666828b15ec3fd28b52a1bd/jRPAQmvg_J6J84Louar6P.jpeg","isPro":true,"fullname":"Gabriel H Sarch","user":"gsarch","type":"user"},"summary":"Large-scale generative language and vision-language models (LLMs and VLMs)\nexcel in few-shot in-context learning for decision making and instruction\nfollowing. However, they require high-quality exemplar demonstrations to be\nincluded in their context window. In this work, we ask: Can LLMs and VLMs\ngenerate their own prompt examples from generic, sub-optimal demonstrations? We\npropose In-Context Abstraction Learning (ICAL), a method that builds a memory\nof multimodal experience insights from sub-optimal demonstrations and human\nfeedback. Given a noisy demonstration in a new domain, VLMs abstract the\ntrajectory into a general program by fixing inefficient actions and annotating\ncognitive abstractions: task relationships, object state changes, temporal\nsubgoals, and task construals. These abstractions are refined and adapted\ninteractively through human feedback while the agent attempts to execute the\ntrajectory in a similar environment. The resulting abstractions, when used as\nexemplars in the prompt, significantly improve decision-making in\nretrieval-augmented LLM and VLM agents. Our ICAL agent surpasses the\nstate-of-the-art in dialogue-based instruction following in TEACh, multimodal\nweb agents in VisualWebArena, and action anticipation in Ego4D. In TEACh, we\nachieve a 12.6% improvement in goal-condition success. In VisualWebArena, our\ntask success rate improves over the SOTA from 14.3% to 22.7%. In Ego4D action\nforecasting, we improve over few-shot GPT-4V and remain competitive with\nsupervised models. We show finetuning our retrieval-augmented in-context agent\nyields additional improvements. Our approach significantly reduces reliance on\nexpert-crafted examples and consistently outperforms in-context learning from\naction plans that lack such insights.","upvotes":5,"discussionId":"6679c5403463f1209e2f9d28","ai_summary":"In-Context Abstraction Learning (ICAL) improves decision-making in LLM and VLM agents by generating high-quality prompt examples from sub-optimal demonstrations and human feedback.","ai_keywords":["LLMs","VLMs","few-shot in-context learning","In-Context Abstraction Learning","ICAL","multimodal experience insights","noisy demonstration","trajectory","general program","inefficient actions","cognitive abstractions","task relationships","object state changes","temporal subgoals","task construals","human feedback","retrieval-augmented agents","dialogue-based instruction following","multimodal web agents","action anticipation","goal-condition success","task success rate","action forecasting","few-shot GPT-4V","supervised models","finetuning"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6666828b15ec3fd28b52a1bd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6666828b15ec3fd28b52a1bd/jRPAQmvg_J6J84Louar6P.jpeg","isPro":true,"fullname":"Gabriel H Sarch","user":"gsarch","type":"user"},{"_id":"655ac762cb17ec19ef82719b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655ac762cb17ec19ef82719b/1kDncYrGLYS_2SR8cNdAL.png","isPro":false,"fullname":"Welcome to matlok","user":"matlok","type":"user"},{"_id":"64587be872b60ae7a3817858","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64587be872b60ae7a3817858/BbdOOxOCEzWTvEpkWp8MM.png","isPro":false,"fullname":"Minbyul Jeong","user":"Minbyul","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"66897694aafa84bf3c0e5048","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66897694aafa84bf3c0e5048/LiFZk1pMIPkG2lEAUkBqf.jpeg","isPro":false,"fullname":"Haley Buck","user":"haleybuck","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights
Abstract
In-Context Abstraction Learning (ICAL) improves decision-making in LLM and VLM agents by generating high-quality prompt examples from sub-optimal demonstrations and human feedback.
Large-scale generative language and vision-language models (LLMs and VLMs) excel in few-shot in-context learning for decision making and instruction following. However, they require high-quality exemplar demonstrations to be included in their context window. In this work, we ask: Can LLMs and VLMs generate their own prompt examples from generic, sub-optimal demonstrations? We propose In-Context Abstraction Learning (ICAL), a method that builds a memory of multimodal experience insights from sub-optimal demonstrations and human feedback. Given a noisy demonstration in a new domain, VLMs abstract the trajectory into a general program by fixing inefficient actions and annotating cognitive abstractions: task relationships, object state changes, temporal subgoals, and task construals. These abstractions are refined and adapted interactively through human feedback while the agent attempts to execute the trajectory in a similar environment. The resulting abstractions, when used as exemplars in the prompt, significantly improve decision-making in retrieval-augmented LLM and VLM agents. Our ICAL agent surpasses the state-of-the-art in dialogue-based instruction following in TEACh, multimodal web agents in VisualWebArena, and action anticipation in Ego4D. In TEACh, we achieve a 12.6% improvement in goal-condition success. In VisualWebArena, our task success rate improves over the SOTA from 14.3% to 22.7%. In Ego4D action forecasting, we improve over few-shot GPT-4V and remain competitive with supervised models. We show finetuning our retrieval-augmented in-context agent yields additional improvements. Our approach significantly reduces reliance on expert-crafted examples and consistently outperforms in-context learning from action plans that lack such insights.
Community
Kudos @gsarch and team. I've featured this paper in my AI research newsletter https://www.aitidbits.ai/p/july-4th-2024#:~:text=Multimodal-,CMU,-and%20Google%20propose
Looking forward to more novel papers and methods.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper