Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - VidGen-1M: A Large-Scale Dataset for Text-to-video Generation
[go: Go Back, main page]

https://sais-fuxi.github.io/projects/vidgen-1m/

\n","updatedAt":"2024-08-06T03:32:31.741Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9179,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.25704818964004517},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[],"isReport":false}},{"id":"66b1c2b81fa16824b63c1db0","author":{"_id":"6341974e9948f573f371c437","avatarUrl":"/avatars/94014bdb53e9a01584b8ce936e9f3821.svg","fullname":"keounghun","name":"bigcomma","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2024-08-06T06:29:12.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"\nThe video shows a highway winding through a lush green landscape. The road is surrounded by dense trees and vegetation both sides. The sky is overcast, and the mountains in the distance are partially obscured by clouds. The highway appears to be in good condition, with clear lane markings. There are several vehicles traveling on the road, including cars and trucks. The colors in the video are predominantly green from the trees and grey from the road and sky.\n![curving-road-through-lush-green-forest.jpg](https://cdn-uploads.huggingface.co/production/uploads/6341974e9948f573f371c437/5sB9jYza9JDPIvL6KUilV.jpeg)\n","html":"

The video shows a highway winding through a lush green landscape. The road is surrounded by dense trees and vegetation both sides. The sky is overcast, and the mountains in the distance are partially obscured by clouds. The highway appears to be in good condition, with clear lane markings. There are several vehicles traveling on the road, including cars and trucks. The colors in the video are predominantly green from the trees and grey from the road and sky.
\"curving-road-through-lush-green-forest.jpg\"

\n","updatedAt":"2024-08-06T06:30:22.237Z","author":{"_id":"6341974e9948f573f371c437","avatarUrl":"/avatars/94014bdb53e9a01584b8ce936e9f3821.svg","fullname":"keounghun","name":"bigcomma","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9457428455352783},"editors":["bigcomma"],"editorAvatarUrls":["/avatars/94014bdb53e9a01584b8ce936e9f3821.svg"],"reactions":[],"isReport":false}},{"id":"66b2cee1c80e840a1f6aec54","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-08-07T01:33:21.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation](https://huggingface.co/papers/2407.02371) (2024)\n* [MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions](https://huggingface.co/papers/2407.06358) (2024)\n* [MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions](https://huggingface.co/papers/2407.20962) (2024)\n* [SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models](https://huggingface.co/papers/2407.20756) (2024)\n* [VIMI: Grounding Video Generation through Multi-modal Instruction](https://huggingface.co/papers/2407.06304) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-08-07T01:33:21.163Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7320595979690552},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"6779d90b4181d8906f11d25c","author":{"_id":"667867adc5786a0b1e40cc92","avatarUrl":"/avatars/278f9788a4a3434e5ff27f0e7866dc56.svg","fullname":"mert","name":"mustafamert35","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-01-05T00:57:47.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"ta","html":"

ta

\n","updatedAt":"2025-01-05T00:57:47.084Z","author":{"_id":"667867adc5786a0b1e40cc92","avatarUrl":"/avatars/278f9788a4a3434e5ff27f0e7866dc56.svg","fullname":"mert","name":"mustafamert35","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"jbo","probability":0.45916223526000977},"editors":["mustafamert35"],"editorAvatarUrls":["/avatars/278f9788a4a3434e5ff27f0e7866dc56.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2408.02629","authors":[{"_id":"66b19948dc60939477270769","name":"Zhiyu Tan","hidden":false},{"_id":"66b19948dc6093947727076a","name":"Xiaomeng Yang","hidden":false},{"_id":"66b19948dc6093947727076b","user":{"_id":"66a9b3533d417b0baa9220a6","avatarUrl":"/avatars/adc372bd24df1d3bf43258833411e8af.svg","isPro":false,"fullname":"Luozheng Qin","user":"Fr0zencr4nE","type":"user"},"name":"Luozheng Qin","status":"admin_assigned","statusLastChangedAt":"2024-08-06T12:05:59.032Z","hidden":false},{"_id":"66b19948dc6093947727076c","name":"Hao Li","hidden":false}],"publishedAt":"2024-08-05T16:53:23.000Z","submittedOnDailyAt":"2024-08-06T02:02:31.736Z","title":"VidGen-1M: A Large-Scale Dataset for Text-to-video Generation","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"The quality of video-text pairs fundamentally determines the upper bound of\ntext-to-video models. Currently, the datasets used for training these models\nsuffer from significant shortcomings, including low temporal consistency,\npoor-quality captions, substandard video quality, and imbalanced data\ndistribution. The prevailing video curation process, which depends on image\nmodels for tagging and manual rule-based curation, leads to a high\ncomputational load and leaves behind unclean data. As a result, there is a lack\nof appropriate training datasets for text-to-video models. To address this\nproblem, we present VidGen-1M, a superior training dataset for text-to-video\nmodels. Produced through a coarse-to-fine curation strategy, this dataset\nguarantees high-quality videos and detailed captions with excellent temporal\nconsistency. When used to train the video generation model, this dataset has\nled to experimental results that surpass those obtained with other models.","upvotes":15,"discussionId":"66b19949dc609394772707ec","ai_summary":"A new, high-quality dataset called VidGen-1M improves training for text-to-video models by ensuring better video quality, detailed captions, and temporal consistency.","ai_keywords":["video-text pairs","text-to-video models","temporal consistency","video quality","caption quality","data imbalance","image models","manual rule-based curation","coarse-to-fine curation strategy","video generation model"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"630412d57373aacccd88af95","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670594087059-630412d57373aacccd88af95.jpeg","isPro":true,"fullname":"Yasunori Ozaki","user":"alfredplpl","type":"user"},{"_id":"65ba471ad88a65abb9328ee2","avatarUrl":"/avatars/956238ce5034091e64d026b0272c4400.svg","isPro":false,"fullname":"Dazhi Jiang","user":"thuzhizhi","type":"user"},{"_id":"64e567c9ddbefb63095a9662","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/F2BwrOU0XpzVI5nd-TL54.png","isPro":false,"fullname":"Bullard ","user":"Charletta1","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","isPro":false,"fullname":"Adina Yakefu","user":"AdinaY","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64c2998cc3a5b1606549bf56","avatarUrl":"/avatars/ddfd54f4cf90bdd615ef1ab409e26a62.svg","isPro":false,"fullname":"Piotr","user":"piotr-ai","type":"user"},{"_id":"63044b2e1dd5d3c624882d19","avatarUrl":"/avatars/ba4d387547d1d0baeca918caea680f89.svg","isPro":false,"fullname":"Patrick Kwon","user":"yj7082126","type":"user"},{"_id":"62b425eb21218c81984c9a92","avatarUrl":"/avatars/e7aafaaf7600b6696c1229f07cd24011.svg","isPro":false,"fullname":"Oliver Pfaffel","user":"OliP","type":"user"},{"_id":"648c9605565e3a44f3c9bb7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648c9605565e3a44f3c9bb7b/W5chvk17Zol6-2QSWkFVR.jpeg","isPro":true,"fullname":"Orr Zohar","user":"orrzohar","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2408.02629

VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Published on Aug 5, 2024
· Submitted by
AK
on Aug 6, 2024
Authors:
,
,

Abstract

A new, high-quality dataset called VidGen-1M improves training for text-to-video models by ensuring better video quality, detailed captions, and temporal consistency.

AI-generated summary

The quality of video-text pairs fundamentally determines the upper bound of text-to-video models. Currently, the datasets used for training these models suffer from significant shortcomings, including low temporal consistency, poor-quality captions, substandard video quality, and imbalanced data distribution. The prevailing video curation process, which depends on image models for tagging and manual rule-based curation, leads to a high computational load and leaves behind unclean data. As a result, there is a lack of appropriate training datasets for text-to-video models. To address this problem, we present VidGen-1M, a superior training dataset for text-to-video models. Produced through a coarse-to-fine curation strategy, this dataset guarantees high-quality videos and detailed captions with excellent temporal consistency. When used to train the video generation model, this dataset has led to experimental results that surpass those obtained with other models.

Community

The video shows a highway winding through a lush green landscape. The road is surrounded by dense trees and vegetation both sides. The sky is overcast, and the mountains in the distance are partially obscured by clouds. The highway appears to be in good condition, with clear lane markings. There are several vehicles traveling on the road, including cars and trucks. The colors in the video are predominantly green from the trees and grey from the road and sky.
curving-road-through-lush-green-forest.jpg

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2408.02629 in a Space README.md to link it from this page.

Collections including this paper 7