Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
[go: Go Back, main page]

@lucidrains\n\t

\n","updatedAt":"2024-02-14T20:02:04.306Z","author":{"_id":"62e54f0eae9d3f10acb95cb9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62e54f0eae9d3f10acb95cb9/VAyk05hqB3OZWXEZW-B0q.png","fullname":"mrfakename","name":"mrfakename","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3480,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.1393810659646988},"editors":["mrfakename"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62e54f0eae9d3f10acb95cb9/VAyk05hqB3OZWXEZW-B0q.png"],"reactions":[{"reaction":"❤️","users":["Subuday"],"count":1}],"isReport":false}},{"id":"65cd46bc7b7a844001dc16b4","author":{"_id":"644e1b1d9b4e87c31bab0a14","avatarUrl":"/avatars/88bb4c4a67dc8958069e9014f5e73a0b.svg","fullname":"Michael Barry","name":"MichaelBarryUK","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false},"createdAt":"2024-02-14T23:03:24.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"> \"However, due to the potential misuse of this capability, we have decided against open-sourcing this model as a precautionary measure.\" I'm really tired of seeing this excuse in speech generation models.\n\nIt's virtue signalling, and can be translated as \"F you, this belongs to me\", but of course they are saints, and saints can't say such things. ","html":"
\n

\"However, due to the potential misuse of this capability, we have decided against open-sourcing this model as a precautionary measure.\" I'm really tired of seeing this excuse in speech generation models.

\n
\n

It's virtue signalling, and can be translated as \"F you, this belongs to me\", but of course they are saints, and saints can't say such things.

\n","updatedAt":"2024-02-14T23:03:24.901Z","author":{"_id":"644e1b1d9b4e87c31bab0a14","avatarUrl":"/avatars/88bb4c4a67dc8958069e9014f5e73a0b.svg","fullname":"Michael Barry","name":"MichaelBarryUK","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9901599287986755},"editors":["MichaelBarryUK"],"editorAvatarUrls":["/avatars/88bb4c4a67dc8958069e9014f5e73a0b.svg"],"reactions":[{"reaction":"👍","users":["phuvo","BTimber","Lyte","anmol-salariya","Nicoli314","AParmar2000","jinqiuqiujin","neoOpus","noAIclue","jyw-drg-001","mschranner","juyil","TrikiNya"],"count":13}],"isReport":false}},{"id":"65cd674f13e8184d9cd7e6c4","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-02-15T01:22:23.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations](https://huggingface.co/papers/2312.14398) (2023)\n* [Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations](https://huggingface.co/papers/2402.03407) (2024)\n* [Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction](https://huggingface.co/papers/2401.01498) (2024)\n* [SpiRit-LM: Interleaved Spoken and Written Language Model](https://huggingface.co/papers/2402.05755) (2024)\n* [Unified Speech-Text Pretraining for Spoken Dialog Modeling](https://huggingface.co/papers/2402.05706) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-02-15T01:22:23.659Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7632550597190857},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"65cf472488d13d8128ab2421","author":{"_id":"647c84d9e07cf9bb2d467f69","avatarUrl":"/avatars/c66f2a4e0da3ba24aa7d4c050026fe6d.svg","fullname":"Maksym Sutkovenko","name":"Subuday","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-02-16T11:29:40.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"cc: @lucidrains ","html":"

cc: \n\n@lucidrains\n\t

\n","updatedAt":"2024-02-16T11:29:40.444Z","author":{"_id":"647c84d9e07cf9bb2d467f69","avatarUrl":"/avatars/c66f2a4e0da3ba24aa7d4c050026fe6d.svg","fullname":"Maksym Sutkovenko","name":"Subuday","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.1393810659646988},"editors":["Subuday"],"editorAvatarUrls":["/avatars/c66f2a4e0da3ba24aa7d4c050026fe6d.svg"],"reactions":[],"isReport":false}},{"id":"65d114126b31c15be7640930","author":{"_id":"64d60cf7a3c9b92761725804","avatarUrl":"/avatars/3abc112a59977d1308c9e539cc031f50.svg","fullname":"ChatBooApp","name":"chatboo","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-02-17T20:16:18.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Is there any word on when this will make it's way to AWS?","html":"

Is there any word on when this will make it's way to AWS?

\n","updatedAt":"2024-02-17T20:16:18.332Z","author":{"_id":"64d60cf7a3c9b92761725804","avatarUrl":"/avatars/3abc112a59977d1308c9e539cc031f50.svg","fullname":"ChatBooApp","name":"chatboo","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9609112739562988},"editors":["chatboo"],"editorAvatarUrls":["/avatars/3abc112a59977d1308c9e539cc031f50.svg"],"reactions":[],"isReport":false}},{"id":"6604e9b3d8ae250868deefe4","author":{"_id":"629463615f2097c17b80fe38","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/629463615f2097c17b80fe38/fNXfixT2dK2WmpC0b92tN.png","fullname":"Anoir Ben Tanfous","name":"neoOpus","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false},"createdAt":"2024-03-28T03:53:23.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Amazon cannot have enough of everybody money... They never help the community or release anything of value.","html":"

Amazon cannot have enough of everybody money... They never help the community or release anything of value.

\n","updatedAt":"2024-03-28T03:53:23.728Z","author":{"_id":"629463615f2097c17b80fe38","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/629463615f2097c17b80fe38/fNXfixT2dK2WmpC0b92tN.png","fullname":"Anoir Ben Tanfous","name":"neoOpus","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.962671160697937},"editors":["neoOpus"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/629463615f2097c17b80fe38/fNXfixT2dK2WmpC0b92tN.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2402.08093","authors":[{"_id":"65cc2d6a11a80579a430d70d","name":"Mateusz Łajszczak","hidden":false},{"_id":"65cc2d6a11a80579a430d70e","user":{"_id":"632862ca27174b0272254f82","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/632862ca27174b0272254f82/TiM-OcoH0jsCQqf0OkWtW.jpeg","isPro":false,"fullname":"Guillermo Cámbara","user":"gcambara","type":"user"},"name":"Guillermo Cámbara","status":"claimed_verified","statusLastChangedAt":"2024-02-14T10:11:25.332Z","hidden":false},{"_id":"65cc2d6a11a80579a430d70f","user":{"_id":"65cccc8ed4f3ed30d1870e37","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65cccc8ed4f3ed30d1870e37/0ETlxoPP187u6XFySa4GZ.jpeg","isPro":false,"fullname":"Yang Li","user":"lizyacbg","type":"user"},"name":"Yang Li","status":"claimed_verified","statusLastChangedAt":"2024-02-15T08:28:02.737Z","hidden":false},{"_id":"65cc2d6a11a80579a430d710","user":{"_id":"5ec973f3968f6028e0559f79","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1628268177397-5ec973f3968f6028e0559f79.jpeg","isPro":false,"fullname":"Fatih Beyhan","user":"fatihbeyhan","type":"user"},"name":"Fatih Beyhan","status":"admin_assigned","statusLastChangedAt":"2024-02-14T10:21:13.167Z","hidden":false},{"_id":"65cc2d6a11a80579a430d711","user":{"_id":"64ad769fad6218d51a1305a5","avatarUrl":"/avatars/6fd6cede811190a3e725be2c1ed06eaf.svg","isPro":false,"fullname":"Arent van Korlaar","user":"afvk","type":"user"},"name":"Arent van Korlaar","status":"claimed_verified","statusLastChangedAt":"2024-02-15T08:28:00.311Z","hidden":false},{"_id":"65cc2d6a11a80579a430d712","name":"Fan Yang","hidden":false},{"_id":"65cc2d6a11a80579a430d713","name":"Arnaud Joly","hidden":false},{"_id":"65cc2d6a11a80579a430d714","name":"Álvaro Martín-Cortinas","hidden":false},{"_id":"65cc2d6a11a80579a430d715","name":"Ammar Abbas","hidden":false},{"_id":"65cc2d6a11a80579a430d716","user":{"_id":"65cd14193f97d5bd5d37f686","avatarUrl":"/avatars/cbfe60b0f294c6c6632e5f3d349842f0.svg","isPro":false,"fullname":"Adam Michalski","user":"micadam","type":"user"},"name":"Adam Michalski","status":"admin_assigned","statusLastChangedAt":"2024-02-15T08:27:45.676Z","hidden":false},{"_id":"65cc2d6a11a80579a430d717","name":"Alexis Moinet","hidden":false},{"_id":"65cc2d6a11a80579a430d718","name":"Sri Karlapati","hidden":false},{"_id":"65cc2d6a11a80579a430d719","name":"Ewa Muszyńska","hidden":false},{"_id":"65cc2d6a11a80579a430d71a","user":{"_id":"63783d9d84318944acd305c4","avatarUrl":"/avatars/cafd4fd0287e77cc6ebf37c8c8509174.svg","isPro":false,"fullname":"Haohan Guo","user":"hhguo","type":"user"},"name":"Haohan Guo","status":"admin_assigned","statusLastChangedAt":"2024-02-14T10:25:35.635Z","hidden":false},{"_id":"65cc2d6a11a80579a430d71b","name":"Bartosz Putrycz","hidden":false},{"_id":"65cc2d6a11a80579a430d71c","name":"Soledad López Gambino","hidden":false},{"_id":"65cc2d6a11a80579a430d71d","name":"Kayeon Yoo","hidden":false},{"_id":"65cc2d6a11a80579a430d71e","name":"Elena Sokolova","hidden":false},{"_id":"65cc2d6a11a80579a430d71f","name":"Thomas Drugman","hidden":false}],"publishedAt":"2024-02-12T22:21:30.000Z","submittedOnDailyAt":"2024-02-14T00:33:07.045Z","title":"BASE TTS: Lessons from building a billion-parameter Text-to-Speech model\n on 100K hours of data","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"We introduce a text-to-speech (TTS) model called BASE TTS, which stands for\nBig Adaptive Streamable TTS with\nEmergent abilities. BASE TTS is the largest TTS model to-date,\ntrained on 100K hours of public domain speech data, achieving a new\nstate-of-the-art in speech naturalness. It deploys a 1-billion-parameter\nautoregressive Transformer that converts raw texts into discrete codes\n(\"speechcodes\") followed by a convolution-based decoder which converts these\nspeechcodes into waveforms in an incremental, streamable manner. Further, our\nspeechcodes are built using a novel speech tokenization technique that features\nspeaker ID disentanglement and compression with byte-pair encoding. Echoing the\nwidely-reported \"emergent abilities\" of large language models when trained on\nincreasing volume of data, we show that BASE TTS variants built with 10K+ hours\nand 500M+ parameters begin to demonstrate natural prosody on textually complex\nsentences. We design and share a specialized dataset to measure these emergent\nabilities for text-to-speech. We showcase state-of-the-art naturalness of BASE\nTTS by evaluating against baselines that include publicly available large-scale\ntext-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated\nby the model can be heard at https://amazon-ltts-paper.com/.","upvotes":62,"discussionId":"65cc2d6b11a80579a430d73f","ai_summary":"BASE TTS, a large-scale text-to-speech model, demonstrates advanced naturalness and emergent abilities using autoregressive Transformers and novel speech tokenization techniques.","ai_keywords":["autoregressive Transformer","speechcodes","convolution-based decoder","speech tokenization","speaker ID disentanglement","byte-pair encoding","natural prosody"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62e54f0eae9d3f10acb95cb9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62e54f0eae9d3f10acb95cb9/VAyk05hqB3OZWXEZW-B0q.png","isPro":true,"fullname":"mrfakename","user":"mrfakename","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"62bbe910161701b462ecd77f","avatarUrl":"/avatars/87e4099ef37dba9d8b4bf62a381e97a9.svg","isPro":false,"fullname":"Senthil Kumar N","user":"SenthilKumarN","type":"user"},{"_id":"6101c620900eaa0057c2ce1d","avatarUrl":"/avatars/bd282166c120711c65b5409dc860ac58.svg","isPro":false,"fullname":"Abdel-Dayane Marcos","user":"admarcosai","type":"user"},{"_id":"6206a506465ea0cc198d5a1c","avatarUrl":"/avatars/d6841d7a250f4c8a1dc79add859041b2.svg","isPro":false,"fullname":"Nguyễn Tiến Đạt","user":"datnt114","type":"user"},{"_id":"6436279eaaef013d1af225c9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6436279eaaef013d1af225c9/31yjIFpqfdvn_n9igumIU.png","isPro":false,"fullname":"Alignment Lab AI","user":"Alignment-Lab-AI","type":"user"},{"_id":"632862ca27174b0272254f82","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/632862ca27174b0272254f82/TiM-OcoH0jsCQqf0OkWtW.jpeg","isPro":false,"fullname":"Guillermo Cámbara","user":"gcambara","type":"user"},{"_id":"62fadca70697d22421a05a36","avatarUrl":"/avatars/ce9fd5d70f56a903a5d0f4de9f6f4034.svg","isPro":false,"fullname":"jineui-kim","user":"engui","type":"user"},{"_id":"5f17f0a0925b9863e28ad517","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5f17f0a0925b9863e28ad517/fXIY5i9RLsIa1v3CCuVtt.jpeg","isPro":true,"fullname":"Victor Mustar","user":"victor","type":"user"},{"_id":"65bb60a5c19f1b2345c39d61","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/CujjHAjdLTNGv61JgSKmv.jpeg","isPro":false,"fullname":"Adrian Aranda","user":"adrianaranda","type":"user"},{"_id":"6362ddb7d3be91534c30bfd6","avatarUrl":"/avatars/dac76ebd3b8a08099497ec0b0524bc7c.svg","isPro":false,"fullname":"Art Atk","user":"ArtAtk","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">
Papers
arxiv:2402.08093

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Published on Feb 12, 2024
· Submitted by
AK
on Feb 14, 2024
#1 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

BASE TTS, a large-scale text-to-speech model, demonstrates advanced naturalness and emergent abilities using autoregressive Transformers and novel speech tokenization techniques.

AI-generated summary

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.

Community

This is incredible! It’s really great with emotions. We need an open sourced implementation!

thi is a really cool idea for an implementation, its really awesome, almost like a hash

"However, due to the potential misuse of this capability, we have decided against open-sourcing this model as a precautionary measure." I'm really tired of seeing this excuse in speech generation models.

"However, due to the potential misuse of this capability, we have decided against open-sourcing this model as a precautionary measure." I'm really tired of seeing this excuse in speech generation models.

It's virtue signalling, and can be translated as "F you, this belongs to me", but of course they are saints, and saints can't say such things.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Is there any word on when this will make it's way to AWS?

Amazon cannot have enough of everybody money... They never help the community or release anything of value.

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.08093 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 12