Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - TRAVL: A Recipe for Making Video-Language Models Better Judges of
Physics Implausibility
https://sam-motamed.github.io/projects/TRAVL\n","updatedAt":"2025-10-10T00:36:45.311Z","author":{"_id":"6475c37b04c82116f9bb2356","avatarUrl":"/avatars/6ec34eb3cfd091a38454ac3de72aaddc.svg","fullname":"saman motamed","name":"sam-motamed","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6772653460502625},"editors":["sam-motamed"],"editorAvatarUrls":["/avatars/6ec34eb3cfd091a38454ac3de72aaddc.svg"],"reactions":[],"isReport":false}},{"id":"68e9b5f2129711a5acdd6758","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-10-11T01:42:10.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Bridging Vision Language Models and Symbolic Grounding for Video Question Answering](https://huggingface.co/papers/2509.11862) (2025)\n* [Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility](https://huggingface.co/papers/2509.24702) (2025)\n* [When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding](https://huggingface.co/papers/2508.15641) (2025)\n* [VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL](https://huggingface.co/papers/2510.02282) (2025)\n* [ViMoNet: A Multimodal Vision-Language Framework for Human Behavior Understanding from Motion and Video](https://huggingface.co/papers/2508.09818) (2025)\n* [Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs](https://huggingface.co/papers/2510.03955) (2025)\n* [Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data](https://huggingface.co/papers/2509.03501) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-10-11T01:42:10.034Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7245833873748779},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2510.07550","authors":[{"_id":"68e854aa95e8e6771df38857","name":"Saman Motamed","hidden":false},{"_id":"68e854aa95e8e6771df38858","name":"Minghao Chen","hidden":false},{"_id":"68e854aa95e8e6771df38859","name":"Luc Van Gool","hidden":false},{"_id":"68e854aa95e8e6771df3885a","name":"Iro Laina","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6475c37b04c82116f9bb2356/xpR8PBdnQQ5tFi3N13TUZ.mp4"],"publishedAt":"2025-10-08T21:03:46.000Z","submittedOnDailyAt":"2025-10-09T23:06:45.305Z","title":"TRAVL: A Recipe for Making Video-Language Models Better Judges of\n Physics Implausibility","submittedOnDailyBy":{"_id":"6475c37b04c82116f9bb2356","avatarUrl":"/avatars/6ec34eb3cfd091a38454ac3de72aaddc.svg","isPro":false,"fullname":"saman motamed","user":"sam-motamed","type":"user"},"summary":"Despite impressive visual fidelity, modern video generative models frequently\nproduce sequences that violate intuitive physical laws, such as objects\nfloating, teleporting, or morphing in ways that defy causality. While humans\ncan easily detect such implausibilities, there remains no robust method for\nquantitatively assessing physical realism in video. In this work, we explore\nwhether Video-Language Models (VLMs) can be trained to serve as reliable judges\nof physical plausibility. We find that existing VLMs struggle to identify\nphysics violations, exposing fundamental limitations in their temporal and\ncausal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe\nthat combines a balanced training dataset with a trajectory-aware attention\nmodule to improve motion encoding and discrimination in VLMs. To evaluate\nphysical reasoning more rigorously, we propose ImplausiBench, a benchmark of\n300 videos (150 real, 150 generated) that removes linguistic biases and\nisolates visual-temporal understanding. Performance is reported both with\ngold-standard human judgments and stricter LLM-as-judge metrics. Together,\nTRAVL and ImplausiBench offer a unified framework for probing and improving\nphysical plausibility in multimodal models, shedding light on a challenging and\nunderexplored aspect of visual-temporal understanding.","upvotes":3,"discussionId":"68e854aa95e8e6771df3885b","projectPage":"https://sam-motamed.github.io/projects/TRAVL","ai_summary":"TRAVL, a fine-tuning recipe with a trajectory-aware attention module, improves physical plausibility in Video-Language Models using the ImplausiBench benchmark.","ai_keywords":["Video-Language Models","VLMs","TRAVL","trajectory-aware attention","motion encoding","discrimination","ImplausiBench","physical plausibility","visual-temporal understanding","LLM-as-judge"],"organization":{"_id":"6585b575110c3eb77dabaa93","name":"INSAIT-Institute","fullname":"Institute for Computer Science, Artificial intelligence and Technology ","avatar":"https://cdn-uploads.huggingface.co/production/uploads/64f1a0700af832a73d0f3e6f/KwpuATq29U2-Fu55OvUHR.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"64f1a0700af832a73d0f3e6f","avatarUrl":"/avatars/45a0235c3bbb2d19f0c58fb7d1068b78.svg","isPro":true,"fullname":"Anton Alexandrov","user":"aalexandrov","type":"user"},{"_id":"686db5d4af2b856fabbf13aa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6BjMv2LVNoqvbX8fQSTPI.png","isPro":false,"fullname":"V bbbb","user":"Bbbbbnnn","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"6585b575110c3eb77dabaa93","name":"INSAIT-Institute","fullname":"Institute for Computer Science, Artificial intelligence and Technology ","avatar":"https://cdn-uploads.huggingface.co/production/uploads/64f1a0700af832a73d0f3e6f/KwpuATq29U2-Fu55OvUHR.png"}}">
TRAVL, a fine-tuning recipe with a trajectory-aware attention module, improves physical plausibility in Video-Language Models using the ImplausiBench benchmark.
AI-generated summary
Despite impressive visual fidelity, modern video generative models frequently
produce sequences that violate intuitive physical laws, such as objects
floating, teleporting, or morphing in ways that defy causality. While humans
can easily detect such implausibilities, there remains no robust method for
quantitatively assessing physical realism in video. In this work, we explore
whether Video-Language Models (VLMs) can be trained to serve as reliable judges
of physical plausibility. We find that existing VLMs struggle to identify
physics violations, exposing fundamental limitations in their temporal and
causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe
that combines a balanced training dataset with a trajectory-aware attention
module to improve motion encoding and discrimination in VLMs. To evaluate
physical reasoning more rigorously, we propose ImplausiBench, a benchmark of
300 videos (150 real, 150 generated) that removes linguistic biases and
isolates visual-temporal understanding. Performance is reported both with
gold-standard human judgments and stricter LLM-as-judge metrics. Together,
TRAVL and ImplausiBench offer a unified framework for probing and improving
physical plausibility in multimodal models, shedding light on a challenging and
underexplored aspect of visual-temporal understanding.