https://sam-motamed.github.io/projects/TRAVL

\n","updatedAt":"2025-10-10T00:36:45.311Z","author":{"_id":"6475c37b04c82116f9bb2356","avatarUrl":"/avatars/6ec34eb3cfd091a38454ac3de72aaddc.svg","fullname":"saman motamed","name":"sam-motamed","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6772653460502625},"editors":["sam-motamed"],"editorAvatarUrls":["/avatars/6ec34eb3cfd091a38454ac3de72aaddc.svg"],"reactions":[],"isReport":false}},{"id":"68e9b5f2129711a5acdd6758","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-10-11T01:42:10.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Bridging Vision Language Models and Symbolic Grounding for Video Question Answering](https://huggingface.co/papers/2509.11862) (2025)\n* [Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility](https://huggingface.co/papers/2509.24702) (2025)\n* [When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding](https://huggingface.co/papers/2508.15641) (2025)\n* [VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL](https://huggingface.co/papers/2510.02282) (2025)\n* [ViMoNet: A Multimodal Vision-Language Framework for Human Behavior Understanding from Motion and Video](https://huggingface.co/papers/2508.09818) (2025)\n* [Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs](https://huggingface.co/papers/2510.03955) (2025)\n* [Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data](https://huggingface.co/papers/2509.03501) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-10-11T01:42:10.034Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7245833873748779},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2510.07550","authors":[{"_id":"68e854aa95e8e6771df38857","name":"Saman Motamed","hidden":false},{"_id":"68e854aa95e8e6771df38858","name":"Minghao Chen","hidden":false},{"_id":"68e854aa95e8e6771df38859","name":"Luc Van Gool","hidden":false},{"_id":"68e854aa95e8e6771df3885a","name":"Iro Laina","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6475c37b04c82116f9bb2356/xpR8PBdnQQ5tFi3N13TUZ.mp4"],"publishedAt":"2025-10-08T21:03:46.000Z","submittedOnDailyAt":"2025-10-09T23:06:45.305Z","title":"TRAVL: A Recipe for Making Video-Language Models Better Judges of\n Physics Implausibility","submittedOnDailyBy":{"_id":"6475c37b04c82116f9bb2356","avatarUrl":"/avatars/6ec34eb3cfd091a38454ac3de72aaddc.svg","isPro":false,"fullname":"saman motamed","user":"sam-motamed","type":"user"},"summary":"Despite impressive visual fidelity, modern video generative models frequently\nproduce sequences that violate intuitive physical laws, such as objects\nfloating, teleporting, or morphing in ways that defy causality. While humans\ncan easily detect such implausibilities, there remains no robust method for\nquantitatively assessing physical realism in video. In this work, we explore\nwhether Video-Language Models (VLMs) can be trained to serve as reliable judges\nof physical plausibility. We find that existing VLMs struggle to identify\nphysics violations, exposing fundamental limitations in their temporal and\ncausal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe\nthat combines a balanced training dataset with a trajectory-aware attention\nmodule to improve motion encoding and discrimination in VLMs. To evaluate\nphysical reasoning more rigorously, we propose ImplausiBench, a benchmark of\n300 videos (150 real, 150 generated) that removes linguistic biases and\nisolates visual-temporal understanding. Performance is reported both with\ngold-standard human judgments and stricter LLM-as-judge metrics. Together,\nTRAVL and ImplausiBench offer a unified framework for probing and improving\nphysical plausibility in multimodal models, shedding light on a challenging and\nunderexplored aspect of visual-temporal understanding.","upvotes":3,"discussionId":"68e854aa95e8e6771df3885b","projectPage":"https://sam-motamed.github.io/projects/TRAVL","ai_summary":"TRAVL, a fine-tuning recipe with a trajectory-aware attention module, improves physical plausibility in Video-Language Models using the ImplausiBench benchmark.","ai_keywords":["Video-Language Models","VLMs","TRAVL","trajectory-aware attention","motion encoding","discrimination","ImplausiBench","physical plausibility","visual-temporal understanding","LLM-as-judge"],"organization":{"_id":"6585b575110c3eb77dabaa93","name":"INSAIT-Institute","fullname":"Institute for Computer Science, Artificial intelligence and Technology ","avatar":"https://cdn-uploads.huggingface.co/production/uploads/64f1a0700af832a73d0f3e6f/KwpuATq29U2-Fu55OvUHR.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"64f1a0700af832a73d0f3e6f","avatarUrl":"/avatars/45a0235c3bbb2d19f0c58fb7d1068b78.svg","isPro":true,"fullname":"Anton Alexandrov","user":"aalexandrov","type":"user"},{"_id":"686db5d4af2b856fabbf13aa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6BjMv2LVNoqvbX8fQSTPI.png","isPro":false,"fullname":"V bbbb","user":"Bbbbbnnn","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"6585b575110c3eb77dabaa93","name":"INSAIT-Institute","fullname":"Institute for Computer Science, Artificial intelligence and Technology ","avatar":"https://cdn-uploads.huggingface.co/production/uploads/64f1a0700af832a73d0f3e6f/KwpuATq29U2-Fu55OvUHR.png"}}">

Papers

arxiv:2510.07550

TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility

Published on Oct 8, 2025

· Submitted by

saman motamed on Oct 9, 2025

Institute for Computer Science, Artificial intelligence and Technology

Upvote

Authors:

Abstract

TRAVL, a fine-tuning recipe with a trajectory-aware attention module, improves physical plausibility in Video-Language Models using the ImplausiBench benchmark.

AI-generated summary

Despite impressive visual fidelity, modern video generative models frequently produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing in ways that defy causality. While humans can easily detect such implausibilities, there remains no robust method for quantitatively assessing physical realism in video. In this work, we explore whether Video-Language Models (VLMs) can be trained to serve as reliable judges of physical plausibility. We find that existing VLMs struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding and discrimination in VLMs. To evaluate physical reasoning more rigorously, we propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding. Performance is reported both with gold-standard human judgments and stricter LLM-as-judge metrics. Together, TRAVL and ImplausiBench offer a unified framework for probing and improving physical plausibility in multimodal models, shedding light on a challenging and underexplored aspect of visual-temporal understanding.