Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Temporal Reasoning Transfer from Text to Video
[go: Go Back, main page]

https://video-t3.github.io/

\n","updatedAt":"2024-10-10T02:15:02.164Z","author":{"_id":"6038d6d0612f5eef3cc05ea9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6038d6d0612f5eef3cc05ea9/ryhvAX5djQpD5OrIlZQ1f.jpeg","fullname":"Lei Li","name":"tobiaslee","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":24,"isUserFollowing":false}},"numEdits":0,"editors":["tobiaslee"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6038d6d0612f5eef3cc05ea9/ryhvAX5djQpD5OrIlZQ1f.jpeg"],"reactions":[],"isReport":false}},{"id":"6707390895ce5fc7ea9db15d","author":{"_id":"6038d6d0612f5eef3cc05ea9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6038d6d0612f5eef3cc05ea9/ryhvAX5djQpD5OrIlZQ1f.jpeg","fullname":"Lei Li","name":"tobiaslee","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":24,"isUserFollowing":false},"createdAt":"2024-10-10T02:16:40.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2024-10-10T02:17:42.674Z","author":{"_id":"6038d6d0612f5eef3cc05ea9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6038d6d0612f5eef3cc05ea9/ryhvAX5djQpD5OrIlZQ1f.jpeg","fullname":"Lei Li","name":"tobiaslee","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":24,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}},{"id":"670880b3761c066c92e2a994","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-10-11T01:34:43.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models](https://huggingface.co/papers/2410.03290) (2024)\n* [Video Instruction Tuning With Synthetic Data](https://huggingface.co/papers/2410.02713) (2024)\n* [From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding](https://huggingface.co/papers/2409.18938) (2024)\n* [Question-Answering Dense Video Events](https://huggingface.co/papers/2409.04388) (2024)\n* [AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark](https://huggingface.co/papers/2410.03051) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-10-11T01:34:43.548Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"6708860fba9fbd20ec41112f","author":{"_id":"630482849aef62c4013c1176","avatarUrl":"/avatars/739bab6b5755ca87645a819dfcf366b0.svg","fullname":"Acdc","name":"Acdcblackz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-10-11T01:57:35.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"What does temporal even mean in this context","html":"

What does temporal even mean in this context

\n","updatedAt":"2024-10-11T01:57:35.181Z","author":{"_id":"630482849aef62c4013c1176","avatarUrl":"/avatars/739bab6b5755ca87645a819dfcf366b0.svg","fullname":"Acdc","name":"Acdcblackz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"editors":["Acdcblackz"],"editorAvatarUrls":["/avatars/739bab6b5755ca87645a819dfcf366b0.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2410.06166","authors":[{"_id":"6707388ec818ba2f6579eb78","user":{"_id":"6038d6d0612f5eef3cc05ea9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6038d6d0612f5eef3cc05ea9/ryhvAX5djQpD5OrIlZQ1f.jpeg","isPro":false,"fullname":"Lei Li","user":"tobiaslee","type":"user"},"name":"Lei Li","status":"admin_assigned","statusLastChangedAt":"2024-10-10T09:11:14.571Z","hidden":false},{"_id":"6707388ec818ba2f6579eb79","user":{"_id":"6489761dcaea79f577897f98","avatarUrl":"/avatars/8f56dc9c08dc2b672555602d68509a03.svg","isPro":false,"fullname":"Yuanxin Liu","user":"lyx97","type":"user"},"name":"Yuanxin Liu","status":"claimed_verified","statusLastChangedAt":"2024-10-10T08:01:34.232Z","hidden":false},{"_id":"6707388ec818ba2f6579eb7a","user":{"_id":"655ca347f426a304c6b393a1","avatarUrl":"/avatars/67f0310d59c5912d38c2ad8e6448614d.svg","isPro":false,"fullname":"Linli Yao","user":"yaolily","type":"user"},"name":"Linli Yao","status":"admin_assigned","statusLastChangedAt":"2024-10-10T09:11:21.755Z","hidden":false},{"_id":"6707388ec818ba2f6579eb7b","user":{"_id":"63565cc56d7fcf1bedb7d347","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63565cc56d7fcf1bedb7d347/XGcHP4VkO_oieA1gZ4IAX.jpeg","isPro":false,"fullname":"Zhang Peiyuan","user":"PY007","type":"user"},"name":"Peiyuan Zhang","status":"admin_assigned","statusLastChangedAt":"2024-10-10T09:11:32.678Z","hidden":false},{"_id":"6707388ec818ba2f6579eb7c","user":{"_id":"64acb321264bbbf171a2b040","avatarUrl":"/avatars/0ad344c0e9b1e3fda469932f91d117dc.svg","isPro":false,"fullname":"Chenxin An","user":"Chancy","type":"user"},"name":"Chenxin An","status":"admin_assigned","statusLastChangedAt":"2024-10-10T09:11:37.979Z","hidden":false},{"_id":"6707388ec818ba2f6579eb7d","user":{"_id":"650c509472afb1e60e6151ae","avatarUrl":"/avatars/c16ab5053a586819dc2b965303215ff7.svg","isPro":false,"fullname":"Lean Wang","user":"AdaHousman","type":"user"},"name":"Lean Wang","status":"admin_assigned","statusLastChangedAt":"2024-10-10T09:11:43.776Z","hidden":false},{"_id":"6707388ec818ba2f6579eb7e","name":"Xu Sun","hidden":false},{"_id":"6707388ec818ba2f6579eb7f","name":"Lingpeng Kong","hidden":false},{"_id":"6707388ec818ba2f6579eb80","user":{"_id":"63ad3de96ee60ca58a409280","avatarUrl":"/avatars/7461f4fda3692f042e556d2a7c339bc0.svg","isPro":false,"fullname":"Qi Liu","user":"QiLiuHKU","type":"user"},"name":"Qi Liu","status":"admin_assigned","statusLastChangedAt":"2024-10-10T09:12:34.540Z","hidden":false}],"publishedAt":"2024-10-08T16:10:29.000Z","submittedOnDailyAt":"2024-10-10T00:45:02.114Z","title":"Temporal Reasoning Transfer from Text to Video","submittedOnDailyBy":{"_id":"6038d6d0612f5eef3cc05ea9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6038d6d0612f5eef3cc05ea9/ryhvAX5djQpD5OrIlZQ1f.jpeg","isPro":false,"fullname":"Lei Li","user":"tobiaslee","type":"user"},"summary":"Video Large Language Models (Video LLMs) have shown promising capabilities in\nvideo comprehension, yet they struggle with tracking temporal changes and\nreasoning about temporal relationships. While previous research attributed this\nlimitation to the ineffective temporal encoding of visual inputs, our\ndiagnostic study reveals that video representations contain sufficient\ninformation for even small probing classifiers to achieve perfect accuracy.\nSurprisingly, we find that the key bottleneck in Video LLMs' temporal reasoning\ncapability stems from the underlying LLM's inherent difficulty with temporal\nconcepts, as evidenced by poor performance on textual temporal\nquestion-answering tasks. Building on this discovery, we introduce the Textual\nTemporal reasoning Transfer (T3). T3 synthesizes diverse temporal reasoning\ntasks in pure text format from existing image-text datasets, addressing the\nscarcity of video samples with complex temporal scenarios. Remarkably, without\nusing any video data, T3 enhances LongVA-7B's temporal understanding, yielding\na 5.3 absolute accuracy improvement on the challenging TempCompass benchmark,\nwhich enables our model to outperform ShareGPT4Video-8B trained on 28,000 video\nsamples. Additionally, the enhanced LongVA-7B model achieves competitive\nperformance on comprehensive video benchmarks. For example, it achieves a 49.7\naccuracy on the Temporal Reasoning task of Video-MME, surpassing powerful\nlarge-scale models such as InternVL-Chat-V1.5-20B and VILA1.5-40B. Further\nanalysis reveals a strong correlation between textual and video temporal task\nperformance, validating the efficacy of transferring temporal reasoning\nabilities from text to video domains.","upvotes":13,"discussionId":"6707388fc818ba2f6579ebd1","projectPage":"https://video-t3.github.io/","githubRepo":"https://github.com/llyx97/video-t3","githubRepoAddedBy":"user","ai_summary":"Textual Temporal reasoning Transfer (T3) enhances Video LLMs' temporal understanding by leveraging text-based temporal tasks, leading to superior performance on video benchmarks without additional video data.","ai_keywords":["Video LLMs","temporal changes","temporal relationships","temporal encoding","temporal reasoning","temporal question-answering","Textual Temporal reasoning Transfer (T3)","TempCompass benchmark","LongVA-7B","ShareGPT4Video-8B","Video-MME","InternVL-Chat-V1.5-20B","VILA1.5-40B"],"githubStars":8},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6038d6d0612f5eef3cc05ea9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6038d6d0612f5eef3cc05ea9/ryhvAX5djQpD5OrIlZQ1f.jpeg","isPro":false,"fullname":"Lei Li","user":"tobiaslee","type":"user"},{"_id":"6489761dcaea79f577897f98","avatarUrl":"/avatars/8f56dc9c08dc2b672555602d68509a03.svg","isPro":false,"fullname":"Yuanxin Liu","user":"lyx97","type":"user"},{"_id":"63ddc7b80f6d2d6c3efe3600","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ddc7b80f6d2d6c3efe3600/RX5q9T80Jl3tn6z03ls0l.jpeg","isPro":false,"fullname":"J","user":"dashfunnydashdash","type":"user"},{"_id":"61f0f66a7855b96a04b223dd","avatarUrl":"/avatars/d17e4a4b467ef9019594036ed8f1ca6e.svg","isPro":false,"fullname":"W","user":"Windy","type":"user"},{"_id":"63f45b8d520c14618930d175","avatarUrl":"/avatars/42b3aaf50748a25e4a596fc57ab1306d.svg","isPro":false,"fullname":"renjie","user":"renjiepi","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"5f32b2367e583543386214d9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1635314457124-5f32b2367e583543386214d9.jpeg","isPro":false,"fullname":"Sergei Averkiev","user":"averoo","type":"user"},{"_id":"63b6f2e752c02ae8acbaa4d8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1672934038280-noauth.jpeg","isPro":false,"fullname":"Habibullah Akbar","user":"ChavyvAkvar","type":"user"},{"_id":"6707c949ceaa2578b5645450","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/zPlCru7Y9bQBCTYcAJ6Ah.png","isPro":false,"fullname":"Jeff badman","user":"Lastat22","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"64f955c582673b2a07fbf0ad","avatarUrl":"/avatars/1c98c8be61f6580c1e4ee698fa5c0716.svg","isPro":false,"fullname":"hongyu","user":"learn12138","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2410.06166

Temporal Reasoning Transfer from Text to Video

Published on Oct 8, 2024
· Submitted by
Lei Li
on Oct 10, 2024
Authors:
Lei Li ,
,
,

Abstract

Textual Temporal reasoning Transfer (T3) enhances Video LLMs' temporal understanding by leveraging text-based temporal tasks, leading to superior performance on video benchmarks without additional video data.

AI-generated summary

Video Large Language Models (Video LLMs) have shown promising capabilities in video comprehension, yet they struggle with tracking temporal changes and reasoning about temporal relationships. While previous research attributed this limitation to the ineffective temporal encoding of visual inputs, our diagnostic study reveals that video representations contain sufficient information for even small probing classifiers to achieve perfect accuracy. Surprisingly, we find that the key bottleneck in Video LLMs' temporal reasoning capability stems from the underlying LLM's inherent difficulty with temporal concepts, as evidenced by poor performance on textual temporal question-answering tasks. Building on this discovery, we introduce the Textual Temporal reasoning Transfer (T3). T3 synthesizes diverse temporal reasoning tasks in pure text format from existing image-text datasets, addressing the scarcity of video samples with complex temporal scenarios. Remarkably, without using any video data, T3 enhances LongVA-7B's temporal understanding, yielding a 5.3 absolute accuracy improvement on the challenging TempCompass benchmark, which enables our model to outperform ShareGPT4Video-8B trained on 28,000 video samples. Additionally, the enhanced LongVA-7B model achieves competitive performance on comprehensive video benchmarks. For example, it achieves a 49.7 accuracy on the Temporal Reasoning task of Video-MME, surpassing powerful large-scale models such as InternVL-Chat-V1.5-20B and VILA1.5-40B. Further analysis reveals a strong correlation between textual and video temporal task performance, validating the efficacy of transferring temporal reasoning abilities from text to video domains.

Community

Paper author Paper submitter
Paper author Paper submitter
This comment has been hidden

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

What does temporal even mean in this context

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.06166 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.06166 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.06166 in a Space README.md to link it from this page.

Collections including this paper 1