Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - A Touch, Vision, and Language Dataset for Multimodal Alignment
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-02-22T01:22:21.609Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.727710485458374},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2402.13232","authors":[{"_id":"65d561795717d720a0c544fc","user":{"_id":"64908d6b0c18343a09361b93","avatarUrl":"/avatars/2fa44344fe760f68f19de775b76f45b8.svg","isPro":false,"fullname":"Max (Letian) Fu","user":"mlfu7","type":"user"},"name":"Letian Fu","status":"extracted_pending","statusLastChangedAt":"2024-02-21T02:35:38.909Z","hidden":false},{"_id":"65d561795717d720a0c544fd","name":"Gaurav Datta","hidden":false},{"_id":"65d561795717d720a0c544fe","name":"Huang Huang","hidden":false},{"_id":"65d561795717d720a0c544ff","name":"William Chung-Ho Panitch","hidden":false},{"_id":"65d561795717d720a0c54500","name":"Jaimyn Drake","hidden":false},{"_id":"65d561795717d720a0c54501","name":"Joseph Ortiz","hidden":false},{"_id":"65d561795717d720a0c54502","name":"Mustafa Mukadam","hidden":false},{"_id":"65d561795717d720a0c54503","name":"Mike Lambeta","hidden":false},{"_id":"65d561795717d720a0c54504","name":"Roberto Calandra","hidden":false},{"_id":"65d561795717d720a0c54505","name":"Ken Goldberg","hidden":false}],"publishedAt":"2024-02-20T18:47:56.000Z","submittedOnDailyAt":"2024-02-21T00:55:47.377Z","title":"A Touch, Vision, and Language Dataset for Multimodal Alignment","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Touch is an important sensing modality for humans, but it has not yet been\nincorporated into a multimodal generative language model. This is partially due\nto the difficulty of obtaining natural language labels for tactile data and the\ncomplexity of aligning tactile readings with both visual observations and\nlanguage descriptions. As a step towards bridging that gap, this work\nintroduces a new dataset of 44K in-the-wild vision-touch pairs, with English\nlanguage labels annotated by humans (10%) and textual pseudo-labels from GPT-4V\n(90%). We use this dataset to train a vision-language-aligned tactile encoder\nfor open-vocabulary classification and a touch-vision-language (TVL) model for\ntext generation using the trained encoder. Results suggest that by\nincorporating touch, the TVL model improves (+29% classification accuracy)\ntouch-vision-language alignment over existing models trained on any pair of\nthose modalities. Although only a small fraction of the dataset is\nhuman-labeled, the TVL model demonstrates improved visual-tactile understanding\nover GPT-4V (+12%) and open-source vision-language models (+32%) on a new\ntouch-vision understanding benchmark. Code and data:\nhttps://tactile-vlm.github.io.","upvotes":16,"discussionId":"65d5617a5717d720a0c5457d","githubRepo":"https://github.com/Max-Fu/tvl","githubRepoAddedBy":"auto","ai_summary":"The study presents a new dataset and multimodal model that integrates touch, vision, and language, showing improvements in touch-vision-language alignment and visual-tactile understanding.","ai_keywords":["multimodal generative language model","tactile data","vision-touch pairs","English language labels","GPT-4V","textual pseudo-labels","vision-language-aligned tactile encoder","touch-vision-language model","text generation","open-vocabulary classification","touch-vision understanding benchmark"],"githubStars":94},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64908d6b0c18343a09361b93","avatarUrl":"/avatars/2fa44344fe760f68f19de775b76f45b8.svg","isPro":false,"fullname":"Max (Letian) Fu","user":"mlfu7","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"62a4ac6fd83c3facafa50892","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62a4ac6fd83c3facafa50892/qFpobw9B5XaLZvwn0XbmB.jpeg","isPro":false,"fullname":"Mohammed Brıman","user":"mohammedbriman","type":"user"},{"_id":"655ac762cb17ec19ef82719b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655ac762cb17ec19ef82719b/1kDncYrGLYS_2SR8cNdAL.png","isPro":false,"fullname":"Welcome to matlok","user":"matlok","type":"user"},{"_id":"635964636a61954080850e1d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/635964636a61954080850e1d/0bfExuDTrHTtm8c-40cDM.png","isPro":false,"fullname":"William Lamkin","user":"phanes","type":"user"},{"_id":"63b6f2e752c02ae8acbaa4d8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1672934038280-noauth.jpeg","isPro":false,"fullname":"Habibullah Akbar","user":"ChavyvAkvar","type":"user"},{"_id":"6527e89a8808d80ccff88b7a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6527e89a8808d80ccff88b7a/CuGNmF1Et8KMQ0mCd1NEJ.jpeg","isPro":true,"fullname":"Not Lain","user":"not-lain","type":"user"},{"_id":"641afd4daebaa27e074f3da8","avatarUrl":"/avatars/4ead5c0d86443b83ef06683775ad5ee4.svg","isPro":false,"fullname":"Arnie Chen","user":"Arnie9203","type":"user"},{"_id":"6093a02dc4a92d63a91c5236","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6093a02dc4a92d63a91c5236/yUte6V0FU0BvVFAbON-9n.jpeg","isPro":true,"fullname":"Diwank Tomer","user":"diwank","type":"user"},{"_id":"6355b6260d6e89270d48e234","avatarUrl":"/avatars/dd6d91febbd2812e5b4bb8ae7a8bdab2.svg","isPro":false,"fullname":"Garin K","user":"gnbk","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2402.13232

A Touch, Vision, and Language Dataset for Multimodal Alignment

Published on Feb 20, 2024
· Submitted by
AK
on Feb 21, 2024
Authors:
,
,
,
,
,
,
,
,

Abstract

The study presents a new dataset and multimodal model that integrates touch, vision, and language, showing improvements in touch-vision-language alignment and visual-tactile understanding.

AI-generated summary

Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative language model. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new dataset of 44K in-the-wild vision-touch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%). We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-vision-language (TVL) model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) touch-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human-labeled, the TVL model demonstrates improved visual-tactile understanding over GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark. Code and data: https://tactile-vlm.github.io.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.13232 in a Space README.md to link it from this page.

Collections including this paper 11