Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - A Touch, Vision, and Language Dataset for Multimodal Alignment
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-02-22T01:22:21.609Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.727710485458374},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2402.13232","authors":[{"_id":"65d561795717d720a0c544fc","user":{"_id":"64908d6b0c18343a09361b93","avatarUrl":"/avatars/2fa44344fe760f68f19de775b76f45b8.svg","isPro":false,"fullname":"Max (Letian) Fu","user":"mlfu7","type":"user"},"name":"Letian Fu","status":"extracted_pending","statusLastChangedAt":"2024-02-21T02:35:38.909Z","hidden":false},{"_id":"65d561795717d720a0c544fd","name":"Gaurav Datta","hidden":false},{"_id":"65d561795717d720a0c544fe","name":"Huang Huang","hidden":false},{"_id":"65d561795717d720a0c544ff","name":"William Chung-Ho Panitch","hidden":false},{"_id":"65d561795717d720a0c54500","name":"Jaimyn Drake","hidden":false},{"_id":"65d561795717d720a0c54501","name":"Joseph Ortiz","hidden":false},{"_id":"65d561795717d720a0c54502","name":"Mustafa Mukadam","hidden":false},{"_id":"65d561795717d720a0c54503","name":"Mike Lambeta","hidden":false},{"_id":"65d561795717d720a0c54504","name":"Roberto Calandra","hidden":false},{"_id":"65d561795717d720a0c54505","name":"Ken Goldberg","hidden":false}],"publishedAt":"2024-02-20T18:47:56.000Z","submittedOnDailyAt":"2024-02-21T00:55:47.377Z","title":"A Touch, Vision, and Language Dataset for Multimodal Alignment","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Touch is an important sensing modality for humans, but it has not yet been\nincorporated into a multimodal generative language model. This is partially due\nto the difficulty of obtaining natural language labels for tactile data and the\ncomplexity of aligning tactile readings with both visual observations and\nlanguage descriptions. As a step towards bridging that gap, this work\nintroduces a new dataset of 44K in-the-wild vision-touch pairs, with English\nlanguage labels annotated by humans (10%) and textual pseudo-labels from GPT-4V\n(90%). We use this dataset to train a vision-language-aligned tactile encoder\nfor open-vocabulary classification and a touch-vision-language (TVL) model for\ntext generation using the trained encoder. Results suggest that by\nincorporating touch, the TVL model improves (+29% classification accuracy)\ntouch-vision-language alignment over existing models trained on any pair of\nthose modalities. Although only a small fraction of the dataset is\nhuman-labeled, the TVL model demonstrates improved visual-tactile understanding\nover GPT-4V (+12%) and open-source vision-language models (+32%) on a new\ntouch-vision understanding benchmark. Code and data:\nhttps://tactile-vlm.github.io.","upvotes":16,"discussionId":"65d5617a5717d720a0c5457d","githubRepo":"https://github.com/Max-Fu/tvl","githubRepoAddedBy":"auto","ai_summary":"The study presents a new dataset and multimodal model that integrates touch, vision, and language, showing improvements in touch-vision-language alignment and visual-tactile understanding.","ai_keywords":["multimodal generative language model","tactile data","vision-touch pairs","English language labels","GPT-4V","textual pseudo-labels","vision-language-aligned tactile encoder","touch-vision-language model","text generation","open-vocabulary classification","touch-vision understanding benchmark"],"githubStars":94},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64908d6b0c18343a09361b93","avatarUrl":"/avatars/2fa44344fe760f68f19de775b76f45b8.svg","isPro":false,"fullname":"Max (Letian) Fu","user":"mlfu7","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"62a4ac6fd83c3facafa50892","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62a4ac6fd83c3facafa50892/qFpobw9B5XaLZvwn0XbmB.jpeg","isPro":false,"fullname":"Mohammed Brıman","user":"mohammedbriman","type":"user"},{"_id":"655ac762cb17ec19ef82719b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655ac762cb17ec19ef82719b/1kDncYrGLYS_2SR8cNdAL.png","isPro":false,"fullname":"Welcome to matlok","user":"matlok","type":"user"},{"_id":"635964636a61954080850e1d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/635964636a61954080850e1d/0bfExuDTrHTtm8c-40cDM.png","isPro":false,"fullname":"William Lamkin","user":"phanes","type":"user"},{"_id":"63b6f2e752c02ae8acbaa4d8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1672934038280-noauth.jpeg","isPro":false,"fullname":"Habibullah Akbar","user":"ChavyvAkvar","type":"user"},{"_id":"6527e89a8808d80ccff88b7a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6527e89a8808d80ccff88b7a/CuGNmF1Et8KMQ0mCd1NEJ.jpeg","isPro":true,"fullname":"Not Lain","user":"not-lain","type":"user"},{"_id":"641afd4daebaa27e074f3da8","avatarUrl":"/avatars/4ead5c0d86443b83ef06683775ad5ee4.svg","isPro":false,"fullname":"Arnie Chen","user":"Arnie9203","type":"user"},{"_id":"6093a02dc4a92d63a91c5236","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6093a02dc4a92d63a91c5236/yUte6V0FU0BvVFAbON-9n.jpeg","isPro":true,"fullname":"Diwank Tomer","user":"diwank","type":"user"},{"_id":"6355b6260d6e89270d48e234","avatarUrl":"/avatars/dd6d91febbd2812e5b4bb8ae7a8bdab2.svg","isPro":false,"fullname":"Garin K","user":"gnbk","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
The study presents a new dataset and multimodal model that integrates touch, vision, and language, showing improvements in touch-vision-language alignment and visual-tactile understanding.
AI-generated summary
Touch is an important sensing modality for humans, but it has not yet been
incorporated into a multimodal generative language model. This is partially due
to the difficulty of obtaining natural language labels for tactile data and the
complexity of aligning tactile readings with both visual observations and
language descriptions. As a step towards bridging that gap, this work
introduces a new dataset of 44K in-the-wild vision-touch pairs, with English
language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V
(90%). We use this dataset to train a vision-language-aligned tactile encoder
for open-vocabulary classification and a touch-vision-language (TVL) model for
text generation using the trained encoder. Results suggest that by
incorporating touch, the TVL model improves (+29% classification accuracy)
touch-vision-language alignment over existing models trained on any pair of
those modalities. Although only a small fraction of the dataset is
human-labeled, the TVL model demonstrates improved visual-tactile understanding
over GPT-4V (+12%) and open-source vision-language models (+32%) on a new
touch-vision understanding benchmark. Code and data:
https://tactile-vlm.github.io.