Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Towards World Simulator: Crafting Physical Commonsense-Based Benchmark
for Video Generation
https://phygenbench123.github.io/\n","updatedAt":"2024-10-10T01:42:35.149Z","author":{"_id":"640b37b2bab5ca8fbe7df8f2","avatarUrl":"/avatars/c7bef45efad6a0d911a720e2236fcba5.svg","fullname":"fanqing meng","name":"FanqingM","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false}},"numEdits":0,"editors":["FanqingM"],"editorAvatarUrls":["/avatars/c7bef45efad6a0d911a720e2236fcba5.svg"],"reactions":[{"reaction":"👍","users":["AdinaY"],"count":1}],"isReport":false}},{"id":"670880b9d2dd526a4751867f","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-10-11T01:34:49.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models](https://huggingface.co/papers/2410.03290) (2024)\n* [ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty](https://huggingface.co/papers/2408.14339) (2024)\n* [Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs](https://huggingface.co/papers/2409.20063) (2024)\n* [EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models](https://huggingface.co/papers/2409.09668) (2024)\n* [EvoChart: A Benchmark and a Self-Training Approach Towards Real-World Chart Understanding](https://huggingface.co/papers/2409.01577) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-10-11T01:34:49.920Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2410.05363","authors":[{"_id":"67073022327cec6882e5d4c9","user":{"_id":"640b37b2bab5ca8fbe7df8f2","avatarUrl":"/avatars/c7bef45efad6a0d911a720e2236fcba5.svg","isPro":false,"fullname":"fanqing meng","user":"FanqingM","type":"user"},"name":"Fanqing Meng","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:27:47.816Z","hidden":false},{"_id":"67073022327cec6882e5d4ca","user":{"_id":"6630b287ce65a66ed8f04eba","avatarUrl":"/avatars/18ad0e3b453021ea6006e23333e92c12.svg","isPro":false,"fullname":"liaojiaqi","user":"ljq940913","type":"user"},"name":"Jiaqi Liao","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:28:33.069Z","hidden":false},{"_id":"67073022327cec6882e5d4cb","name":"Xinyu Tan","hidden":false},{"_id":"67073022327cec6882e5d4cc","user":{"_id":"64b3fd42eec33e27dcc4c941","avatarUrl":"/avatars/5aa1a99468fa61d4b8b0e80b592c4e55.svg","isPro":false,"fullname":"Wenqi Shao","user":"wqshao126","type":"user"},"name":"Wenqi Shao","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:29:10.044Z","hidden":false},{"_id":"67073022327cec6882e5d4cd","user":{"_id":"653a483dacdeea08424ef55d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/tYG2isIZBLCaBbWABatgK.png","isPro":false,"fullname":"Quanfeng Lu","user":"hflqf88888","type":"user"},"name":"Quanfeng Lu","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:29:17.379Z","hidden":false},{"_id":"67073022327cec6882e5d4ce","user":{"_id":"63527f4e7d071f23d085ad45","avatarUrl":"/avatars/99a51adef5673b3ac1a8c02eb47759c4.svg","isPro":false,"fullname":"KAIPENG ZHANG","user":"kpzhang","type":"user"},"name":"Kaipeng Zhang","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:29:23.401Z","hidden":false},{"_id":"67073022327cec6882e5d4cf","user":{"_id":"67017abfe4d49b157ac534d9","avatarUrl":"/avatars/997e1b9f54b27a7728a9d4abfee4ba91.svg","isPro":false,"fullname":"Yu Cheng","user":"ych133","type":"user"},"name":"Yu Cheng","status":"claimed_verified","statusLastChangedAt":"2024-10-11T07:44:23.224Z","hidden":false},{"_id":"67073022327cec6882e5d4d0","name":"Dianqi Li","hidden":false},{"_id":"67073022327cec6882e5d4d1","name":"Yu Qiao","hidden":false},{"_id":"67073022327cec6882e5d4d2","name":"Ping Luo","hidden":false}],"publishedAt":"2024-10-07T17:56:04.000Z","submittedOnDailyAt":"2024-10-10T00:08:47.088Z","title":"Towards World Simulator: Crafting Physical Commonsense-Based Benchmark\n for Video Generation","submittedOnDailyBy":{"_id":"640b37b2bab5ca8fbe7df8f2","avatarUrl":"/avatars/c7bef45efad6a0d911a720e2236fcba5.svg","isPro":false,"fullname":"fanqing meng","user":"FanqingM","type":"user"},"summary":"Text-to-video (T2V) models like Sora have made significant strides in\nvisualizing complex prompts, which is increasingly viewed as a promising path\ntowards constructing the universal world simulator. Cognitive psychologists\nbelieve that the foundation for achieving this goal is the ability to\nunderstand intuitive physics. However, the capacity of these models to\naccurately represent intuitive physics remains largely unexplored. To bridge\nthis gap, we introduce PhyGenBench, a comprehensive Physics\nGeneration Benchmark designed to evaluate physical\ncommonsense correctness in T2V generation. PhyGenBench comprises 160 carefully\ncrafted prompts across 27 distinct physical laws, spanning four fundamental\ndomains, which could comprehensively assesses models' understanding of physical\ncommonsense. Alongside PhyGenBench, we propose a novel evaluation framework\ncalled PhyGenEval. This framework employs a hierarchical evaluation structure\nutilizing appropriate advanced vision-language models and large language models\nto assess physical commonsense. Through PhyGenBench and PhyGenEval, we can\nconduct large-scale automated assessments of T2V models' understanding of\nphysical commonsense, which align closely with human feedback. Our evaluation\nresults and in-depth analysis demonstrate that current models struggle to\ngenerate videos that comply with physical commonsense. Moreover, simply scaling\nup models or employing prompt engineering techniques is insufficient to fully\naddress the challenges presented by PhyGenBench (e.g., dynamic scenarios). We\nhope this study will inspire the community to prioritize the learning of\nphysical commonsense in these models beyond entertainment applications. We will\nrelease the data and codes at https://github.com/OpenGVLab/PhyGenBench","upvotes":45,"discussionId":"67073024327cec6882e5d530","githubRepo":"https://github.com/opengvlab/phygenbench","githubRepoAddedBy":"auto","ai_summary":"PhyGenBench and PhyGenEval assess text-to-video models' understanding of physical commonsense, revealing shortcomings that cannot be fully addressed by scaling models or prompt engineering.","ai_keywords":["text-to-video models","physics generation benchmark","PhyGenBench","PhyGenEval","physical commonsense","vision-language models","large language models","automated assessments"],"githubStars":149},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"640b37b2bab5ca8fbe7df8f2","avatarUrl":"/avatars/c7bef45efad6a0d911a720e2236fcba5.svg","isPro":false,"fullname":"fanqing meng","user":"FanqingM","type":"user"},{"_id":"64aea082704210bf815e7551","avatarUrl":"/avatars/5c8dc0df57596c526b2bccea21835f53.svg","isPro":false,"fullname":"Mengzhao Chen","user":"ChenMnZ","type":"user"},{"_id":"6687bae5586426849539def9","avatarUrl":"/avatars/29bb7d0a630c66371a25bdf5426f73ea.svg","isPro":false,"fullname":"lyingCS","user":"llyingCS","type":"user"},{"_id":"66b2e1be5409fe4fb5c54960","avatarUrl":"/avatars/dc62e39576cabf186c4ab47ba5df08a5.svg","isPro":false,"fullname":"Chonghe Jiang","user":"LeoJiangOR","type":"user"},{"_id":"625f8694673e5862a8c05d5c","avatarUrl":"/avatars/4f22e22b23af8082a92659b72a639c55.svg","isPro":false,"fullname":"Nguyen Trung Hieu","user":"JunHill","type":"user"},{"_id":"660251fb51c362e2102d4018","avatarUrl":"/avatars/97f92d94b1dcca0f1aa063294c399db3.svg","isPro":false,"fullname":"AC S","user":"WITHER1","type":"user"},{"_id":"60b60fefb24432e20fe0ae87","avatarUrl":"/avatars/d30da69133e8fcca27b1875eea156b20.svg","isPro":false,"fullname":"Aaron_Han","user":"AaronHan","type":"user"},{"_id":"65aa20a399c3bd19c70730fd","avatarUrl":"/avatars/b3896472fb84383533add969b96e8a44.svg","isPro":false,"fullname":"Bao Nguyen","user":"baonn","type":"user"},{"_id":"63ddc7b80f6d2d6c3efe3600","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ddc7b80f6d2d6c3efe3600/RX5q9T80Jl3tn6z03ls0l.jpeg","isPro":false,"fullname":"J","user":"dashfunnydashdash","type":"user"},{"_id":"64f0022ad2bab867b1bd25bc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/7w0IXcA-9c-Wz6QNfwefK.png","isPro":false,"fullname":"May_Lee","user":"cicada1212","type":"user"},{"_id":"641c0644a6d3ff92426efc0e","avatarUrl":"/avatars/0b6a97e63f260b7c68788fd1e5585f1f.svg","isPro":false,"fullname":"SincereXu","user":"SincereX","type":"user"},{"_id":"66b2e2f0f1a9bb7be48383fb","avatarUrl":"/avatars/a932c3bd3dc9f7b34566163ce039726b.svg","isPro":false,"fullname":"zzy","user":"zhaobaba","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
PhyGenBench and PhyGenEval assess text-to-video models' understanding of physical commonsense, revealing shortcomings that cannot be fully addressed by scaling models or prompt engineering.
AI-generated summary
Text-to-video (T2V) models like Sora have made significant strides in
visualizing complex prompts, which is increasingly viewed as a promising path
towards constructing the universal world simulator. Cognitive psychologists
believe that the foundation for achieving this goal is the ability to
understand intuitive physics. However, the capacity of these models to
accurately represent intuitive physics remains largely unexplored. To bridge
this gap, we introduce PhyGenBench, a comprehensive Physics
Generation Benchmark designed to evaluate physical
commonsense correctness in T2V generation. PhyGenBench comprises 160 carefully
crafted prompts across 27 distinct physical laws, spanning four fundamental
domains, which could comprehensively assesses models' understanding of physical
commonsense. Alongside PhyGenBench, we propose a novel evaluation framework
called PhyGenEval. This framework employs a hierarchical evaluation structure
utilizing appropriate advanced vision-language models and large language models
to assess physical commonsense. Through PhyGenBench and PhyGenEval, we can
conduct large-scale automated assessments of T2V models' understanding of
physical commonsense, which align closely with human feedback. Our evaluation
results and in-depth analysis demonstrate that current models struggle to
generate videos that comply with physical commonsense. Moreover, simply scaling
up models or employing prompt engineering techniques is insufficient to fully
address the challenges presented by PhyGenBench (e.g., dynamic scenarios). We
hope this study will inspire the community to prioritize the learning of
physical commonsense in these models beyond entertainment applications. We will
release the data and codes at https://github.com/OpenGVLab/PhyGenBench