Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
[go: Go Back, main page]

https://arxiv.org/abs/2412.01800
Code: https://github.com/PhysGame/PhysGame
Data: https://huggingface.co/PhysGame\n\n","updatedAt":"2024-12-03T10:17:28.527Z","author":{"_id":"638efcf4c67af472d316d424","avatarUrl":"/avatars/97a57859d7d87a3a8f1bb41d32a72bc2.svg","fullname":"Ge Zhang","name":"zhangysk","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":77,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8265988826751709},"editors":["zhangysk"],"editorAvatarUrls":["/avatars/97a57859d7d87a3a8f1bb41d32a72bc2.svg"],"reactions":[],"isReport":false}},{"id":"674fb1dc4f3870c7486a1c3c","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-12-04T01:35:24.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?](https://huggingface.co/papers/2411.10979) (2024)\n* [Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation](https://huggingface.co/papers/2410.05363) (2024)\n* [VCBench: A Controllable Benchmark for Symbolic and Abstract Challenges in Video Cognition](https://huggingface.co/papers/2411.09105) (2024)\n* [VideoSAVi: Self-Aligned Video Language Models without Human Supervision](https://huggingface.co/papers/2412.00624) (2024)\n* [TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models](https://huggingface.co/papers/2410.10818) (2024)\n* [VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models](https://huggingface.co/papers/2410.11417) (2024)\n* [On the Consistency of Video Large Language Models in Temporal Comprehension](https://huggingface.co/papers/2411.12951) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-12-04T01:35:24.133Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7045120000839233},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2412.01800","authors":[{"_id":"674edab4e8df509dc90ab774","name":"Meng Cao","hidden":false},{"_id":"674edab4e8df509dc90ab775","name":"Haoran Tang","hidden":false},{"_id":"674edab4e8df509dc90ab776","user":{"_id":"6713de8eb8c4b8d06fc44418","avatarUrl":"/avatars/7753bd37363632e03180d7c3048b59f7.svg","isPro":false,"fullname":"Haoze ZHAO","user":"ZHZ2002","type":"user"},"name":"Haoze Zhao","status":"admin_assigned","statusLastChangedAt":"2024-12-03T16:12:38.116Z","hidden":false},{"_id":"674edab4e8df509dc90ab777","user":{"_id":"646b43deb1202bc77c1024a4","avatarUrl":"/avatars/cf791574ab986bac274e7fbcf04e2a59.svg","isPro":false,"fullname":"hangyu guo","user":"Rosiness","type":"user"},"name":"Hangyu Guo","status":"admin_assigned","statusLastChangedAt":"2024-12-03T16:12:50.388Z","hidden":false},{"_id":"674edab4e8df509dc90ab778","user":{"_id":"65377c30e48353201e6fdda0","avatarUrl":"/avatars/a8f803b6f2e598eaee9c52c0d2ddfc16.svg","isPro":false,"fullname":"Jiaheng Liu","user":"CheeryLJH","type":"user"},"name":"Jiaheng Liu","status":"admin_assigned","statusLastChangedAt":"2024-12-03T16:12:56.177Z","hidden":false},{"_id":"674edab4e8df509dc90ab779","user":{"_id":"638efcf4c67af472d316d424","avatarUrl":"/avatars/97a57859d7d87a3a8f1bb41d32a72bc2.svg","isPro":false,"fullname":"Ge Zhang","user":"zhangysk","type":"user"},"name":"Ge Zhang","status":"claimed_verified","statusLastChangedAt":"2024-12-03T11:14:40.537Z","hidden":false},{"_id":"674edab4e8df509dc90ab77a","user":{"_id":"650f0e3988cdfe73a864b9c5","avatarUrl":"/avatars/ce74b96743a99819a762c23b4f2204fd.svg","isPro":false,"fullname":"Ruyang Liu","user":"farewellthree","type":"user"},"name":"Ruyang Liu","status":"admin_assigned","statusLastChangedAt":"2024-12-03T16:13:02.101Z","hidden":false},{"_id":"674edab4e8df509dc90ab77b","user":{"_id":"66a850167747a5de8ec191c0","avatarUrl":"/avatars/c47b1e4ba6299f185887f281c3e24dc6.svg","isPro":false,"fullname":"Qiang Sun","user":"sunisrisingnow","type":"user"},"name":"Qiang Sun","status":"admin_assigned","statusLastChangedAt":"2024-12-03T16:13:10.028Z","hidden":false},{"_id":"674edab4e8df509dc90ab77c","name":"Ian Reid","hidden":false},{"_id":"674edab4e8df509dc90ab77d","name":"Xiaodan Liang","hidden":false}],"publishedAt":"2024-12-02T18:47:25.000Z","submittedOnDailyAt":"2024-12-03T07:47:28.519Z","title":"PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos","submittedOnDailyBy":{"_id":"638efcf4c67af472d316d424","avatarUrl":"/avatars/97a57859d7d87a3a8f1bb41d32a72bc2.svg","isPro":false,"fullname":"Ge Zhang","user":"zhangysk","type":"user"},"summary":"Recent advancements in video-based large language models (Video LLMs) have\nwitnessed the emergence of diverse capabilities to reason and interpret dynamic\nvisual content. Among them, gameplay videos stand out as a distinctive data\nsource, often containing glitches that defy physics commonsense. This\ncharacteristic renders them an effective benchmark for assessing the\nunder-explored capability of physical commonsense understanding in video LLMs.\nIn this paper, we propose PhysGame as a pioneering benchmark to evaluate\nphysical commonsense violations in gameplay videos. PhysGame comprises 880\nvideos associated with glitches spanning four fundamental domains (i.e.,\nmechanics, kinematics, optics, and material properties) and across 12 distinct\nphysical commonsense. Through extensively evaluating various state-ofthe-art\nvideo LLMs, our findings reveal that the performance of current open-source\nvideo LLMs significantly lags behind that of proprietary counterparts. To\nbridge this gap, we curate an instruction tuning dataset PhysInstruct with\n140,057 question-answering pairs to facilitate physical commonsense learning.\nIn addition, we also propose a preference optimization dataset PhysDPO with\n34,358 training pairs, where the dis-preferred responses are generated\nconditioned on misleading titles (i.e., meta information hacking), fewer frames\n(i.e., temporal hacking) and lower spatial resolutions (i.e., spatial hacking).\nBased on the suite of datasets, we propose PhysVLM as a physical\nknowledge-enhanced video LLM. Extensive experiments on both physical-oriented\nbenchmark PhysGame and general video understanding benchmarks demonstrate the\nstate-ofthe-art performance of PhysVLM.","upvotes":6,"discussionId":"674edab5e8df509dc90ab847","ai_summary":"PhysGame and PhysInstruct datasets, along with preference optimization PhysDPO, enhance video LLMs' physical commonsense understanding, demonstrating superior performance over existing models.","ai_keywords":["video LLMs","PhysGame","game glitches","physical commonsense","PhysInstruct","PhysDPO","preference optimization","PhysVLM","mechanics","kinematics","optics","material properties"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"638efcf4c67af472d316d424","avatarUrl":"/avatars/97a57859d7d87a3a8f1bb41d32a72bc2.svg","isPro":false,"fullname":"Ge Zhang","user":"zhangysk","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"6713de8eb8c4b8d06fc44418","avatarUrl":"/avatars/7753bd37363632e03180d7c3048b59f7.svg","isPro":false,"fullname":"Haoze ZHAO","user":"ZHZ2002","type":"user"},{"_id":"65decc75beffeb39ba679eba","avatarUrl":"/avatars/735b678bd5863a0c1b1bdd3bbf8858fa.svg","isPro":true,"fullname":"r","user":"oceansweep","type":"user"},{"_id":"672a3632418355ec939489bd","avatarUrl":"/avatars/d98ab18217fdd405451457ce90cf56ce.svg","isPro":false,"fullname":"Yashi Zeng","user":"yukidump","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2412.01800

PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos

Published on Dec 2, 2024
· Submitted by
Ge Zhang
on Dec 3, 2024
Authors:
,
,
,

Abstract

PhysGame and PhysInstruct datasets, along with preference optimization PhysDPO, enhance video LLMs' physical commonsense understanding, demonstrating superior performance over existing models.

AI-generated summary

Recent advancements in video-based large language models (Video LLMs) have witnessed the emergence of diverse capabilities to reason and interpret dynamic visual content. Among them, gameplay videos stand out as a distinctive data source, often containing glitches that defy physics commonsense. This characteristic renders them an effective benchmark for assessing the under-explored capability of physical commonsense understanding in video LLMs. In this paper, we propose PhysGame as a pioneering benchmark to evaluate physical commonsense violations in gameplay videos. PhysGame comprises 880 videos associated with glitches spanning four fundamental domains (i.e., mechanics, kinematics, optics, and material properties) and across 12 distinct physical commonsense. Through extensively evaluating various state-ofthe-art video LLMs, our findings reveal that the performance of current open-source video LLMs significantly lags behind that of proprietary counterparts. To bridge this gap, we curate an instruction tuning dataset PhysInstruct with 140,057 question-answering pairs to facilitate physical commonsense learning. In addition, we also propose a preference optimization dataset PhysDPO with 34,358 training pairs, where the dis-preferred responses are generated conditioned on misleading titles (i.e., meta information hacking), fewer frames (i.e., temporal hacking) and lower spatial resolutions (i.e., spatial hacking). Based on the suite of datasets, we propose PhysVLM as a physical knowledge-enhanced video LLM. Extensive experiments on both physical-oriented benchmark PhysGame and general video understanding benchmarks demonstrate the state-ofthe-art performance of PhysVLM.

Community

Paper author Paper submitter

The First Evaluation Benchmark for Physical Commonsense Understanding Based on Gameplay Videos

  1. Based on "glitch phenomena" that violate physical commonsense in gameplay videos, we constructed the PhysGame benchmark to evaluate the physical commonsense understanding of current multimodal large language models.
  2. We developed the PhysInstruct-140K and PhysDPO-34K datasets for supervised fine-tuning (SFT) and direct preference optimization (DPO) training, respectively.
  3. We introduced a strong baseline model that achieves state-of-the-art performance on both the PhysGame benchmark and general video understanding datasets.
  4. The paper, code, and datasets are open-sourced.
    Preprints: https://arxiv.org/abs/2412.01800
    Code: https://github.com/PhysGame/PhysGame
    Data: https://huggingface.co/PhysGame

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.01800 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2412.01800 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.01800 in a Space README.md to link it from this page.

Collections including this paper 3