Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - PiTe: Pixel-Temporal Alignment for Large Video-Language Model
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-09-14T01:33:03.415Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.715046763420105},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2409.07239","authors":[{"_id":"66e40348038300b07a7dda0f","user":{"_id":"6586817e509bcae23f3dfc60","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/pwEStoey1XJYDi3F0Z930.png","isPro":false,"fullname":"Yang Liu","user":"yliu-cs","type":"user"},"name":"Yang Liu","status":"claimed_verified","statusLastChangedAt":"2025-05-21T09:10:53.960Z","hidden":false},{"_id":"66e40348038300b07a7dda10","name":"Pengxiang Ding","hidden":false},{"_id":"66e40348038300b07a7dda11","user":{"_id":"65fd82762bf2cd20ddaa193f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/yBYbWp_mT7UusYdkqtAvw.png","isPro":false,"fullname":"Siteng Huang","user":"huangsiteng","type":"user"},"name":"Siteng Huang","status":"claimed_verified","statusLastChangedAt":"2024-09-13T12:09:08.813Z","hidden":false},{"_id":"66e40348038300b07a7dda12","name":"Min Zhang","hidden":false},{"_id":"66e40348038300b07a7dda13","user":{"_id":"646dbbc8075bbcc48ddcecbf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646dbbc8075bbcc48ddcecbf/V52Em-78O5F3QxRbRwG5O.jpeg","isPro":false,"fullname":"Han Zhao","user":"han1997","type":"user"},"name":"Han Zhao","status":"claimed_verified","statusLastChangedAt":"2025-10-18T04:52:34.823Z","hidden":false},{"_id":"66e40348038300b07a7dda14","name":"Donglin Wang","hidden":false}],"publishedAt":"2024-09-11T12:53:07.000Z","submittedOnDailyAt":"2024-09-13T07:50:31.822Z","title":"PiTe: Pixel-Temporal Alignment for Large Video-Language Model","submittedOnDailyBy":{"_id":"65fd82762bf2cd20ddaa193f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/yBYbWp_mT7UusYdkqtAvw.png","isPro":false,"fullname":"Siteng Huang","user":"huangsiteng","type":"user"},"summary":"Fueled by the Large Language Models (LLMs) wave, Large Visual-Language Models\n(LVLMs) have emerged as a pivotal advancement, bridging the gap between image\nand text. However, video making it challenging for LVLMs to perform adequately\ndue to the complexity of the relationship between language and spatial-temporal\ndata structure. Recent Large Video-Language Models (LVidLMs) align feature of\nstatic visual data like image into latent space of language feature, by general\nmulti-modal tasks to leverage abilities of LLMs sufficiently. In this paper, we\nexplore fine-grained alignment approach via object trajectory for different\nmodalities across both spatial and temporal dimensions simultaneously. Thus, we\npropose a novel LVidLM by trajectory-guided Pixel-Temporal Alignment, dubbed\nPiTe, that exhibits promising applicable model property. To achieve\nfine-grained video-language alignment, we curate a multi-modal pre-training\ndataset PiTe-143k, the dataset provision of moving trajectories in pixel level\nfor all individual objects, that appear and mention in the video and caption\nboth, by our automatic annotation pipeline. Meanwhile, PiTe demonstrates\nastounding capabilities on myriad video-related multi-modal tasks through beat\nthe state-of-the-art methods by a large margin.","upvotes":15,"discussionId":"66e4034a038300b07a7ddaac","githubRepo":"https://github.com/yliu-cs/pite","githubRepoAddedBy":"auto","ai_summary":"A new video-language model, PiTe, uses trajectory-guided pixel-temporal alignment to achieve superior performance across various multimodal video tasks by leveraging a large pre-training dataset with precise object trajectories.","ai_keywords":["Large Language Models (LLMs)","Large Visual-Language Models (LVLMs)","video","spatial-temporal data","Large Video-Language Models (LVidLMs)","latent space","multi-modal tasks","fine-grained alignment","object trajectory","trajectory-guided Pixel-Temporal Alignment","PiTe","pre-training dataset","PiTe-143k","automatic annotation pipeline"],"githubStars":17},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65fd82762bf2cd20ddaa193f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/yBYbWp_mT7UusYdkqtAvw.png","isPro":false,"fullname":"Siteng Huang","user":"huangsiteng","type":"user"},{"_id":"653288809326d6da5ff12f5a","avatarUrl":"/avatars/388bbbd3ce472da1e37b882141d47323.svg","isPro":false,"fullname":"Lupi","user":"Chodevil","type":"user"},{"_id":"6586817e509bcae23f3dfc60","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/pwEStoey1XJYDi3F0Z930.png","isPro":false,"fullname":"Yang Liu","user":"yliu-cs","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"636f533c1ca0ea5107ed171d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/636f533c1ca0ea5107ed171d/jLwsrcPtUiHj8WhcE0Y67.jpeg","isPro":false,"fullname":"Bhimraj Yadav","user":"bhimrazy","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"609653c1146ef3bfe2fc7392","avatarUrl":"/avatars/1639b6552a419209ae67b6562183bc2f.svg","isPro":false,"fullname":"Inui","user":"Norm","type":"user"},{"_id":"64380ae1819f3ab20d17431b","avatarUrl":"/avatars/a36b073c1c783102ddb455204fd816bd.svg","isPro":false,"fullname":"ZhenyuLiu","user":"foggyforest","type":"user"},{"_id":"62b049e653d878042fab5673","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b049e653d878042fab5673/lSgJdq2m-oLHHeeuMeSBV.jpeg","isPro":false,"fullname":"Ahmad","user":"AhmadHakami","type":"user"},{"_id":"663ccbff3a74a20189d4aa2e","avatarUrl":"/avatars/83a54455e0157480f65c498cd9057cf2.svg","isPro":false,"fullname":"Nguyen Van Thanh","user":"NguyenVanThanhHust","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
A new video-language model, PiTe, uses trajectory-guided pixel-temporal alignment to achieve superior performance across various multimodal video tasks by leveraging a large pre-training dataset with precise object trajectories.
AI-generated summary
Fueled by the Large Language Models (LLMs) wave, Large Visual-Language Models
(LVLMs) have emerged as a pivotal advancement, bridging the gap between image
and text. However, video making it challenging for LVLMs to perform adequately
due to the complexity of the relationship between language and spatial-temporal
data structure. Recent Large Video-Language Models (LVidLMs) align feature of
static visual data like image into latent space of language feature, by general
multi-modal tasks to leverage abilities of LLMs sufficiently. In this paper, we
explore fine-grained alignment approach via object trajectory for different
modalities across both spatial and temporal dimensions simultaneously. Thus, we
propose a novel LVidLM by trajectory-guided Pixel-Temporal Alignment, dubbed
PiTe, that exhibits promising applicable model property. To achieve
fine-grained video-language alignment, we curate a multi-modal pre-training
dataset PiTe-143k, the dataset provision of moving trajectories in pixel level
for all individual objects, that appear and mention in the video and caption
both, by our automatic annotation pipeline. Meanwhile, PiTe demonstrates
astounding capabilities on myriad video-related multi-modal tasks through beat
the state-of-the-art methods by a large margin.
ECCV 2024 Oral. We present PiTe, a novel Large Video-Language Model (LVidLM) that achieves state-of-the-art performance in video understanding tasks through a trajectory-guided Pixel-Temporal Alignment approach. PiTe aligns visual and textual data across spatial and temporal dimensions by leveraging a curated multi-modal pre-training dataset, PiTe-143k, which provides moving trajectories at the pixel level for individual objects in videos. This approach enables PiTe to comprehend videos with greater detail and accuracy, outperforming existing methods in question-answering, temporal grounding, and dense captioning tasks.