Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-03-07T01:13:26.016Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7421753406524658},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"65ea22f1e11f5f345199c150","author":{"_id":"63c1b3a770b05b9663749170","avatarUrl":"/avatars/a41a1074f417046295aea9b658b83d78.svg","fullname":"Elliott Dyson","name":"ElliottDyson","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-03-07T20:26:25.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This work looks very promising. Do you believe it may be possible with appropriate transcriptions in the dataset to embed control over tones-of-voice/emotion as has seen to be possible with models such as Bark or Tortoise, or would the lack of transformer encoders for any aspect other than 'timbre extraction' (as worded in the paper) make this unlikely?","html":"

This work looks very promising. Do you believe it may be possible with appropriate transcriptions in the dataset to embed control over tones-of-voice/emotion as has seen to be possible with models such as Bark or Tortoise, or would the lack of transformer encoders for any aspect other than 'timbre extraction' (as worded in the paper) make this unlikely?

\n","updatedAt":"2024-03-07T20:26:25.757Z","author":{"_id":"63c1b3a770b05b9663749170","avatarUrl":"/avatars/a41a1074f417046295aea9b658b83d78.svg","fullname":"Elliott Dyson","name":"ElliottDyson","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.961438775062561},"editors":["ElliottDyson"],"editorAvatarUrls":["/avatars/a41a1074f417046295aea9b658b83d78.svg"],"reactions":[{"reaction":"➕","users":["filkin"],"count":1}],"isReport":false}},{"id":"67504b181b5de64e6047fe0f","author":{"_id":"6731c8308a5d38a13672911f","avatarUrl":"/avatars/8b954eaab0cee4422ee174380916c9d5.svg","fullname":"Kawish Abbas","name":"kawish14","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-12-04T12:29:12.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"good effort ","html":"

good effort

\n","updatedAt":"2024-12-04T12:30:50.910Z","author":{"_id":"6731c8308a5d38a13672911f","avatarUrl":"/avatars/8b954eaab0cee4422ee174380916c9d5.svg","fullname":"Kawish Abbas","name":"kawish14","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9222255945205688},"editors":["kawish14"],"editorAvatarUrls":["/avatars/8b954eaab0cee4422ee174380916c9d5.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2403.03100","authors":[{"_id":"65e7f92674ab027493c2f417","name":"Zeqian Ju","hidden":false},{"_id":"65e7f92674ab027493c2f418","user":{"_id":"63072d60cd148dbc5e49f4dd","avatarUrl":"/avatars/ffa61038c0ff20848fbcde7c1c34570e.svg","isPro":false,"fullname":"Yuancheng Wang","user":"Hecheng0625","type":"user"},"name":"Yuancheng Wang","status":"claimed_verified","statusLastChangedAt":"2025-08-27T07:13:42.757Z","hidden":false},{"_id":"65e7f92674ab027493c2f419","name":"Kai Shen","hidden":false},{"_id":"65e7f92674ab027493c2f41a","user":{"_id":"5f1040b6e9d71719e3be71d2","avatarUrl":"/avatars/a2f28940236ae625ed3810ad62e343ff.svg","isPro":false,"fullname":"Xu Tan","user":"xutan","type":"user"},"name":"Xu Tan","status":"extracted_confirmed","statusLastChangedAt":"2024-04-05T05:51:26.298Z","hidden":false},{"_id":"65e7f92674ab027493c2f41b","name":"Detai Xin","hidden":false},{"_id":"65e7f92674ab027493c2f41c","name":"Dongchao Yang","hidden":false},{"_id":"65e7f92674ab027493c2f41d","name":"Yanqing Liu","hidden":false},{"_id":"65e7f92674ab027493c2f41e","user":{"_id":"64a0347b528a9bbe59d6e08c","avatarUrl":"/avatars/6dd0bad84d711d1048a0a4169e621773.svg","isPro":false,"fullname":"Yichong Leng","user":"ustcscallion","type":"user"},"name":"Yichong Leng","status":"admin_assigned","statusLastChangedAt":"2024-03-06T09:56:12.761Z","hidden":false},{"_id":"65e7f92674ab027493c2f41f","user":{"_id":"5fc0b2b61160c47d1d438568","avatarUrl":"/avatars/b355912b0ec683e73f21c8d36620e146.svg","isPro":false,"fullname":"Kaitao Song","user":"KaitaoSong","type":"user"},"name":"Kaitao Song","status":"admin_assigned","statusLastChangedAt":"2024-03-06T09:56:19.656Z","hidden":false},{"_id":"65e7f92674ab027493c2f420","name":"Siliang Tang","hidden":false},{"_id":"65e7f92674ab027493c2f421","user":{"_id":"63b4dcefa50cfcefdaa121f3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b4dcefa50cfcefdaa121f3/MlIxOaTSCbZARVyo8Ly7r.jpeg","isPro":false,"fullname":"Dr Wuz","user":"drwuz","type":"user"},"name":"Zhizheng Wu","status":"claimed_verified","statusLastChangedAt":"2024-07-05T06:56:14.830Z","hidden":false},{"_id":"65e7f92674ab027493c2f422","name":"Tao Qin","hidden":false},{"_id":"65e7f92674ab027493c2f423","name":"Xiang-Yang Li","hidden":false},{"_id":"65e7f92674ab027493c2f424","name":"Wei Ye","hidden":false},{"_id":"65e7f92674ab027493c2f425","name":"Shikun Zhang","hidden":false},{"_id":"65e7f92674ab027493c2f426","user":{"_id":"63f253f8f4e30ffd2bd308fb","avatarUrl":"/avatars/303f4c7ee588f638acf78a7966786e1e.svg","isPro":false,"fullname":"Jiang Bian","user":"bianjiang","type":"user"},"name":"Jiang Bian","status":"admin_assigned","statusLastChangedAt":"2024-03-06T09:58:06.909Z","hidden":false},{"_id":"65e7f92674ab027493c2f427","name":"Lei He","hidden":false},{"_id":"65e7f92674ab027493c2f428","name":"Jinyu Li","hidden":false},{"_id":"65e7f92674ab027493c2f429","name":"Sheng Zhao","hidden":false}],"publishedAt":"2024-03-05T16:35:25.000Z","submittedOnDailyAt":"2024-03-06T02:33:34.833Z","title":"NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and\n Diffusion Models","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"While recent large-scale text-to-speech (TTS) models have achieved\nsignificant progress, they still fall short in speech quality, similarity, and\nprosody. Considering speech intricately encompasses various attributes (e.g.,\ncontent, prosody, timbre, and acoustic details) that pose significant\nchallenges for generation, a natural idea is to factorize speech into\nindividual subspaces representing different attributes and generate them\nindividually. Motivated by it, we propose NaturalSpeech 3, a TTS system with\nnovel factorized diffusion models to generate natural speech in a zero-shot\nway. Specifically, 1) we design a neural codec with factorized vector\nquantization (FVQ) to disentangle speech waveform into subspaces of content,\nprosody, timbre, and acoustic details; 2) we propose a factorized diffusion\nmodel to generate attributes in each subspace following its corresponding\nprompt. With this factorization design, NaturalSpeech 3 can effectively and\nefficiently model the intricate speech with disentangled subspaces in a\ndivide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the\nstate-of-the-art TTS systems on quality, similarity, prosody, and\nintelligibility. Furthermore, we achieve better performance by scaling to 1B\nparameters and 200K hours of training data.","upvotes":38,"discussionId":"65e7f92674ab027493c2f444","githubRepo":"https://github.com/lifeiteng/naturalspeech3_facodec","githubRepoAddedBy":"auto","ai_summary":"NaturalSpeech 3, using factorized diffusion models, outperforms other TTS systems by generating disentangled speech attributes in a zero-shot manner.","ai_keywords":["factorized diffusion models","neural codec","factorized vector quantization (FVQ)","subspace","content","prosody","timbre","acoustic details","text-to-speech (TTS)","quality","similarity","intelligibility"],"githubStars":237},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"635cada2c017767a629db012","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667018139063-noauth.jpeg","isPro":false,"fullname":"Ojasvi Singh Yadav","user":"ojasvisingh786","type":"user"},{"_id":"6362ddb7d3be91534c30bfd6","avatarUrl":"/avatars/dac76ebd3b8a08099497ec0b0524bc7c.svg","isPro":false,"fullname":"Art Atk","user":"ArtAtk","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"65a6a82d21943858bd8a7a7d","avatarUrl":"/avatars/6f93a29d4487c236997b9c573b15c65a.svg","isPro":false,"fullname":"Luke Yin","user":"LukeMathi","type":"user"},{"_id":"6329ab0bde18e8b2d96157ff","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6329ab0bde18e8b2d96157ff/alQVvAaeb0zh54b4XGVIK.png","isPro":false,"fullname":"Evan","user":"evdcush","type":"user"},{"_id":"63072d60cd148dbc5e49f4dd","avatarUrl":"/avatars/ffa61038c0ff20848fbcde7c1c34570e.svg","isPro":false,"fullname":"Yuancheng Wang","user":"Hecheng0625","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"632f892b2636f057d5888e3a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/632f892b2636f057d5888e3a/1oj58bTxmT_ZiXhDNAILR.png","isPro":false,"fullname":"Danielus","user":"danielus","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"62eb82f65606ac7afb6b94cf","avatarUrl":"/avatars/fa6a52fd8d11f58111a533c96d5bae33.svg","isPro":false,"fullname":"Eran","user":"kaufmane","type":"user"},{"_id":"60c8d264224e250fb0178f77","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60c8d264224e250fb0178f77/i8fbkBVcoFeJRmkQ9kYAE.png","isPro":false,"fullname":"Adam Lee","user":"Abecid","type":"user"},{"_id":"62aaaaf55a99fb2669bcd0e3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1655352046059-noauth.jpeg","isPro":false,"fullname":"GaggiX","user":"GaggiX","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">
Papers
arxiv:2403.03100

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Published on Mar 5, 2024
· Submitted by
AK
on Mar 6, 2024
#1 Paper of the day
Authors:
,
,
Xu Tan ,
,
,
,
,
,
,
,
,
,
,

Abstract

NaturalSpeech 3, using factorized diffusion models, outperforms other TTS systems by generating disentangled speech attributes in a zero-shot manner.

AI-generated summary

While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With this factorization design, NaturalSpeech 3 can effectively and efficiently model the intricate speech with disentangled subspaces in a divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility. Furthermore, we achieve better performance by scaling to 1B parameters and 200K hours of training data.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

This work looks very promising. Do you believe it may be possible with appropriate transcriptions in the dataset to embed control over tones-of-voice/emotion as has seen to be possible with models such as Bark or Tortoise, or would the lack of transformer encoders for any aspect other than 'timbre extraction' (as worded in the paper) make this unlikely?

good effort

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2403.03100 in a dataset README.md to link it from this page.

Spaces citing this paper 35

Collections including this paper 13