Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Bridging the Data Provenance Gap Across Text, Speech and Video
@librarian-bot\n\t recommend\n","updatedAt":"2024-12-25T20:44:24.280Z","author":{"_id":"62645f88c39850dc093d6105","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1650745211725-noauth.png","fullname":"Mohammed Hamdy","name":"mmhamdy","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":73,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7918877601623535},"editors":["mmhamdy"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1650745211725-noauth.png"],"reactions":[],"isReport":false},"replies":[{"id":"676c6eae4464f476aad264f1","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-12-25T20:44:30.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Public Domain 12M: A Highly Aesthetic Image-Text Dataset with Novel Governance Mechanisms](https://huggingface.co/papers/2410.23144) (2024)\n* [BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks](https://huggingface.co/papers/2412.04626) (2024)\n* [A Systematic Review of NeurIPS Dataset Management Practices](https://huggingface.co/papers/2411.00266) (2024)\n* [The State and Fate of Summarization Datasets](https://huggingface.co/papers/2411.04585) (2024)\n* [RedPajama: an Open Dataset for Training Large Language Models](https://huggingface.co/papers/2411.12372) (2024)\n* [Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards](https://huggingface.co/papers/2411.05224) (2024)\n* [UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages](https://huggingface.co/papers/2411.14343) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-12-25T20:44:30.791Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7255401611328125},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"676c6ea87fff9075b5a3f9a0"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2412.17847","authors":[{"_id":"676b6f6e1f5ca46174ac96c8","name":"Shayne Longpre","hidden":false},{"_id":"676b6f6e1f5ca46174ac96c9","user":{"_id":"6374384c250cf1379bc278f8","avatarUrl":"/avatars/69cf03a6cfa8a76eb227f86b1711a76e.svg","isPro":false,"fullname":"Nikhil Singh","user":"nsingh1","type":"user"},"name":"Nikhil Singh","status":"claimed_verified","statusLastChangedAt":"2025-05-29T07:47:22.961Z","hidden":false},{"_id":"676b6f6e1f5ca46174ac96ca","name":"Manuel Cherep","hidden":false},{"_id":"676b6f6e1f5ca46174ac96cb","name":"Kushagra Tiwary","hidden":false},{"_id":"676b6f6e1f5ca46174ac96cc","name":"Joanna Materzynska","hidden":false},{"_id":"676b6f6e1f5ca46174ac96cd","user":{"_id":"641bdd25a63c4e8062387b6a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641bdd25a63c4e8062387b6a/dFanBhG6_NVqpRNB-jjC4.png","isPro":false,"fullname":"William Brannon","user":"wwbrannon","type":"user"},"name":"William Brannon","status":"claimed_verified","statusLastChangedAt":"2025-02-06T14:15:29.452Z","hidden":false},{"_id":"676b6f6e1f5ca46174ac96ce","name":"Robert Mahari","hidden":false},{"_id":"676b6f6e1f5ca46174ac96cf","name":"Manan Dey","hidden":false},{"_id":"676b6f6e1f5ca46174ac96d0","user":{"_id":"62645f88c39850dc093d6105","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1650745211725-noauth.png","isPro":false,"fullname":"Mohammed Hamdy","user":"mmhamdy","type":"user"},"name":"Mohammed Hamdy","status":"claimed_verified","statusLastChangedAt":"2024-12-30T19:33:07.482Z","hidden":false},{"_id":"676b6f6e1f5ca46174ac96d1","user":{"_id":"63b6de91b004c84aac8b970b","avatarUrl":"/avatars/fa27d141497093fe0f2d91bf537fad33.svg","isPro":false,"fullname":"Nayan Saxena","user":"saxenan3","type":"user"},"name":"Nayan Saxena","status":"claimed_verified","statusLastChangedAt":"2025-12-03T09:21:40.852Z","hidden":false},{"_id":"676b6f6e1f5ca46174ac96d2","user":{"_id":"6246908d8031dcfa9ef6d80b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6246908d8031dcfa9ef6d80b/hVdURjUl1RS2MZf4qOhvI.jpeg","isPro":false,"fullname":"Ahmad Mustafa Anis","user":"AhmadMustafa","type":"user"},"name":"Ahmad Mustafa Anis","status":"claimed_verified","statusLastChangedAt":"2024-12-30T19:33:00.926Z","hidden":false},{"_id":"676b6f6e1f5ca46174ac96d3","name":"Emad A. Alghamdi","hidden":false},{"_id":"676b6f6e1f5ca46174ac96d4","user":{"_id":"60535c9d10aba34e3b6a2ef7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1616075850230-noauth.jpeg","isPro":false,"fullname":"vumichien","user":"vumichien","type":"user"},"name":"Vu Minh Chien","status":"claimed_verified","statusLastChangedAt":"2025-02-12T09:19:03.796Z","hidden":false},{"_id":"676b6f6e1f5ca46174ac96d5","name":"Naana Obeng-Marnu","hidden":false},{"_id":"676b6f6e1f5ca46174ac96d6","name":"Da Yin","hidden":false},{"_id":"676b6f6e1f5ca46174ac96d7","name":"Kun Qian","hidden":false},{"_id":"676b6f6e1f5ca46174ac96d8","user":{"_id":"6382252f54421460665ec501","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6382252f54421460665ec501/gW9fev3T5QPcNq4f9hqB1.jpeg","isPro":false,"fullname":"Yizhi Li","user":"yizhilll","type":"user"},"name":"Yizhi Li","status":"claimed_verified","statusLastChangedAt":"2025-07-25T08:40:07.901Z","hidden":false},{"_id":"676b6f6e1f5ca46174ac96d9","name":"Minnie Liang","hidden":false},{"_id":"676b6f6e1f5ca46174ac96da","name":"An Dinh","hidden":false},{"_id":"676b6f6e1f5ca46174ac96db","name":"Shrestha Mohanty","hidden":false},{"_id":"676b6f6e1f5ca46174ac96dc","user":{"_id":"6040a00558b78f3a0047c23a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6040a00558b78f3a0047c23a/_BsyJoaCBO3r6GnfSgnIA.jpeg","isPro":false,"fullname":"David Mataciunas","user":"DeividasM","type":"user"},"name":"Deividas Mataciunas","status":"claimed_verified","statusLastChangedAt":"2024-12-30T19:33:05.129Z","hidden":false},{"_id":"676b6f6e1f5ca46174ac96dd","name":"Tobin South","hidden":false},{"_id":"676b6f6e1f5ca46174ac96de","name":"Jianguo Zhang","hidden":false},{"_id":"676b6f6e1f5ca46174ac96df","user":{"_id":"638bcfa91987d67b340e6c1c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638bcfa91987d67b340e6c1c/3tHCB_J6c4-lsEZ_zJSlp.jpeg","isPro":false,"fullname":"Ariel N. Lee","user":"arielnlee","type":"user"},"name":"Ariel N. Lee","status":"claimed_verified","statusLastChangedAt":"2025-02-05T16:54:01.989Z","hidden":false},{"_id":"676b6f6e1f5ca46174ac96e0","name":"Campbell S. Lund","hidden":false},{"_id":"676b6f6e1f5ca46174ac96e1","name":"Christopher Klamm","hidden":false},{"_id":"676b6f6e1f5ca46174ac96e2","name":"Damien Sileo","hidden":false},{"_id":"676b6f6e1f5ca46174ac96e3","name":"Diganta Misra","hidden":false},{"_id":"676b6f6e1f5ca46174ac96e4","user":{"_id":"6276ba3c2d26ac639e5a2b01","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6276ba3c2d26ac639e5a2b01/ULDyT9ijOye7rfXqTWFVG.jpeg","isPro":false,"fullname":"Enrico Shippole","user":"conceptofmind","type":"user"},"name":"Enrico Shippole","status":"claimed_verified","statusLastChangedAt":"2025-06-23T08:34:23.309Z","hidden":false},{"_id":"676b6f6e1f5ca46174ac96e5","name":"Kevin Klyman","hidden":false},{"_id":"676b6f6e1f5ca46174ac96e6","user":{"_id":"634e20a0c1ce28f1de920cc4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1666064515342-noauth.jpeg","isPro":true,"fullname":"Lj V. Miranda","user":"ljvmiranda921","type":"user"},"name":"Lester JV Miranda","status":"claimed_verified","statusLastChangedAt":"2025-01-06T08:00:50.985Z","hidden":false},{"_id":"676b6f6e1f5ca46174ac96e7","name":"Niklas Muennighoff","hidden":false},{"_id":"676b6f6e1f5ca46174ac96e8","name":"Seonghyeon Ye","hidden":false},{"_id":"676b6f6e1f5ca46174ac96e9","user":{"_id":"6469949654873f0043b09c22","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/Lk7IJAR16Wa_sGJ2g81AQ.jpeg","isPro":true,"fullname":"Seungone Kim","user":"seungone","type":"user"},"name":"Seungone Kim","status":"claimed_verified","statusLastChangedAt":"2024-12-30T19:33:02.972Z","hidden":false},{"_id":"676b6f6e1f5ca46174ac96ea","name":"Vipul Gupta","hidden":false},{"_id":"676b6f6e1f5ca46174ac96eb","name":"Vivek Sharma","hidden":false},{"_id":"676b6f6e1f5ca46174ac96ec","name":"Xuhui Zhou","hidden":false},{"_id":"676b6f6e1f5ca46174ac96ed","name":"Caiming Xiong","hidden":false},{"_id":"676b6f6e1f5ca46174ac96ee","name":"Luis Villa","hidden":false},{"_id":"676b6f6e1f5ca46174ac96ef","user":{"_id":"60347d3660e3dd96631c9093","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60347d3660e3dd96631c9093/B3fuZer5N04tZIAYrLnz4.jpeg","isPro":false,"fullname":"Stella Biderman","user":"stellaathena","type":"user"},"name":"Stella Biderman","status":"claimed_verified","statusLastChangedAt":"2025-06-07T05:50:33.769Z","hidden":false},{"_id":"676b6f6e1f5ca46174ac96f0","name":"Alex Pentland","hidden":false},{"_id":"676b6f6e1f5ca46174ac96f1","user":{"_id":"63434eb76f59b79da07dbddf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63434eb76f59b79da07dbddf/BEwmVjqPNYlqmutXG0G6e.jpeg","isPro":false,"fullname":"Sara Hooker","user":"sarahooker","type":"user"},"name":"Sara Hooker","status":"claimed_verified","statusLastChangedAt":"2025-01-20T09:30:34.082Z","hidden":false},{"_id":"676b6f6e1f5ca46174ac96f2","name":"Jad Kabbara","hidden":false}],"publishedAt":"2024-12-19T01:30:19.000Z","submittedOnDailyAt":"2024-12-25T18:14:24.255Z","title":"Bridging the Data Provenance Gap Across Text, Speech and Video","submittedOnDailyBy":{"_id":"62645f88c39850dc093d6105","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1650745211725-noauth.png","isPro":false,"fullname":"Mohammed Hamdy","user":"mmhamdy","type":"user"},"summary":"Progress in AI is driven largely by the scale and quality of training data.\nDespite this, there is a deficit of empirical analysis examining the attributes\nof well-established datasets beyond text. In this work we conduct the largest\nand first-of-its-kind longitudinal audit across modalities--popular text,\nspeech, and video datasets--from their detailed sourcing trends and use\nrestrictions to their geographical and linguistic representation. Our manual\nanalysis covers nearly 4000 public datasets between 1990-2024, spanning 608\nlanguages, 798 sources, 659 organizations, and 67 countries. We find that\nmultimodal machine learning applications have overwhelmingly turned to\nweb-crawled, synthetic, and social media platforms, such as YouTube, for their\ntraining sets, eclipsing all other sources since 2019. Secondly, tracing the\nchain of dataset derivations we find that while less than 33% of datasets are\nrestrictively licensed, over 80% of the source content in widely-used text,\nspeech, and video datasets, carry non-commercial restrictions. Finally, counter\nto the rising number of languages and geographies represented in public AI\ntraining datasets, our audit demonstrates measures of relative geographical and\nmultilingual representation have failed to significantly improve their coverage\nsince 2013. We believe the breadth of our audit enables us to empirically\nexamine trends in data sourcing, restrictions, and Western-centricity at an\necosystem-level, and that visibility into these questions are essential to\nprogress in responsible AI. As a contribution to ongoing improvements in\ndataset transparency and responsible use, we release our entire multimodal\naudit, allowing practitioners to trace data provenance across text, speech, and\nvideo.","upvotes":10,"discussionId":"676b6f6f1f5ca46174ac9777","ai_keywords":["multimodal machine learning applications","web-crawled","synthetic","social media platforms","YouTube","dataset derivations","restrictively licensed","non-commercial restrictions","geographical representation","multilingual representation","Western-centricity","data sourcing","dataset transparency","responsible use","data provenance"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62645f88c39850dc093d6105","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1650745211725-noauth.png","isPro":false,"fullname":"Mohammed Hamdy","user":"mmhamdy","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"630fcbd758d83e8f64d82777","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630fcbd758d83e8f64d82777/0jwTa1MzeXZbDtJ-uXGo6.jpeg","isPro":false,"fullname":"Jianguo Zhang","user":"jianguozhang","type":"user"},{"_id":"641bdd25a63c4e8062387b6a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641bdd25a63c4e8062387b6a/dFanBhG6_NVqpRNB-jjC4.png","isPro":false,"fullname":"William Brannon","user":"wwbrannon","type":"user"},{"_id":"6469949654873f0043b09c22","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/Lk7IJAR16Wa_sGJ2g81AQ.jpeg","isPro":true,"fullname":"Seungone Kim","user":"seungone","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"634e20a0c1ce28f1de920cc4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1666064515342-noauth.jpeg","isPro":true,"fullname":"Lj V. Miranda","user":"ljvmiranda921","type":"user"},{"_id":"60535c9d10aba34e3b6a2ef7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1616075850230-noauth.jpeg","isPro":false,"fullname":"vumichien","user":"vumichien","type":"user"},{"_id":"6276ba3c2d26ac639e5a2b01","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6276ba3c2d26ac639e5a2b01/ULDyT9ijOye7rfXqTWFVG.jpeg","isPro":false,"fullname":"Enrico Shippole","user":"conceptofmind","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Progress in AI is driven largely by the scale and quality of training data.
Despite this, there is a deficit of empirical analysis examining the attributes
of well-established datasets beyond text. In this work we conduct the largest
and first-of-its-kind longitudinal audit across modalities--popular text,
speech, and video datasets--from their detailed sourcing trends and use
restrictions to their geographical and linguistic representation. Our manual
analysis covers nearly 4000 public datasets between 1990-2024, spanning 608
languages, 798 sources, 659 organizations, and 67 countries. We find that
multimodal machine learning applications have overwhelmingly turned to
web-crawled, synthetic, and social media platforms, such as YouTube, for their
training sets, eclipsing all other sources since 2019. Secondly, tracing the
chain of dataset derivations we find that while less than 33% of datasets are
restrictively licensed, over 80% of the source content in widely-used text,
speech, and video datasets, carry non-commercial restrictions. Finally, counter
to the rising number of languages and geographies represented in public AI
training datasets, our audit demonstrates measures of relative geographical and
multilingual representation have failed to significantly improve their coverage
since 2013. We believe the breadth of our audit enables us to empirically
examine trends in data sourcing, restrictions, and Western-centricity at an
ecosystem-level, and that visibility into these questions are essential to
progress in responsible AI. As a contribution to ongoing improvements in
dataset transparency and responsible use, we release our entire multimodal
audit, allowing practitioners to trace data provenance across text, speech, and
video.