https://arxiv.org/abs/2506.02863
🌐 Project Page: https://wanghelin1997.github.io/CapSpeech-demo/
🚀 Spaces Demo: https://huggingface.co/spaces/OpenSound/CapSpeech-TTS

\n","updatedAt":"2025-06-05T20:20:12.951Z","author":{"_id":"66e25e5e96acd45223d7a167","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66e25e5e96acd45223d7a167/0Nk8QtwfpRe4ElhTjuv51.png","fullname":"OpenSound","name":"OpenSound","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":92,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6961281299591064},"editors":["OpenSound"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66e25e5e96acd45223d7a167/0Nk8QtwfpRe4ElhTjuv51.png"],"reactions":[],"isReport":false}},{"id":"6841fe73bea374c55e30aa38","author":{"_id":"63ecfb5ec5b3c734085db9ed","avatarUrl":"/avatars/0b1d03dcd7997ad1daa764fb76f88993.svg","fullname":"Helin Wang","name":"westbrook","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false},"createdAt":"2025-06-05T20:30:43.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"CapSpeech is a new benchmark designed for style-captioned TTS (CapTTS) tasks, including style-captioned text-to-speech synthesis with sound effects (CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS (EmoCapTTS) and text-to-speech synthesis for chat agent (AgentTTS). \r\n\r\nCapSpeech comprises over 10 million machine-annotated audio-caption pairs and nearly 0.36 million human-annotated audio-caption pairs. 3 new speech datasets are specifically designed for the CapTTS-SE and AgentTTS tasks to enhance the benchmark’s coverage of real-world scenarios.","html":"

CapSpeech is a new benchmark designed for style-captioned TTS (CapTTS) tasks, including style-captioned text-to-speech synthesis with sound effects (CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS (EmoCapTTS) and text-to-speech synthesis for chat agent (AgentTTS).

CapSpeech comprises over 10 million machine-annotated audio-caption pairs and nearly 0.36 million human-annotated audio-caption pairs. 3 new speech datasets are specifically designed for the CapTTS-SE and AgentTTS tasks to enhance the benchmark’s coverage of real-world scenarios.

\n","updatedAt":"2025-06-05T20:30:43.149Z","author":{"_id":"63ecfb5ec5b3c734085db9ed","avatarUrl":"/avatars/0b1d03dcd7997ad1daa764fb76f88993.svg","fullname":"Helin Wang","name":"westbrook","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8169719576835632},"editors":["westbrook"],"editorAvatarUrls":["/avatars/0b1d03dcd7997ad1daa764fb76f88993.svg"],"reactions":[],"isReport":false}},{"id":"684246970e7e14aac45877bd","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-06-06T01:38:31.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2](https://huggingface.co/papers/2505.17320) (2025)\n* [Muyan-TTS: A Trainable Text-to-Speech Model Optimized for Podcast Scenarios with a $50K Budget](https://huggingface.co/papers/2504.19146) (2025)\n* [Kimi-Audio Technical Report](https://huggingface.co/papers/2504.18425) (2025)\n* [AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation](https://huggingface.co/papers/2504.20629) (2025)\n* [Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis](https://huggingface.co/papers/2505.12226) (2025)\n* [GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor](https://huggingface.co/papers/2505.19384) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-06-06T01:38:31.046Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7245440483093262},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2506.02863","authors":[{"_id":"683fa315cd4314c4b3d53451","user":{"_id":"63ecfb5ec5b3c734085db9ed","avatarUrl":"/avatars/0b1d03dcd7997ad1daa764fb76f88993.svg","isPro":false,"fullname":"Helin Wang","user":"westbrook","type":"user"},"name":"Helin Wang","status":"claimed_verified","statusLastChangedAt":"2025-06-06T07:42:05.814Z","hidden":false},{"_id":"683fa315cd4314c4b3d53452","user":{"_id":"6363e9e4969bdae89e10b5ac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6363e9e4969bdae89e10b5ac/uR0jz8pBJIQPmKC90OJgI.jpeg","isPro":false,"fullname":"Jiarui Hai","user":"Higobeatz","type":"user"},"name":"Jiarui Hai","status":"claimed_verified","statusLastChangedAt":"2025-06-06T07:42:08.194Z","hidden":false},{"_id":"683fa315cd4314c4b3d53453","name":"Dading Chong","hidden":false},{"_id":"683fa315cd4314c4b3d53454","user":{"_id":"634452e00a284e1e9f1d3d9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/634452e00a284e1e9f1d3d9d/jlJXRNxg4OH-zT8z9oDrr.jpeg","isPro":false,"fullname":"Karan Thakkar","user":"carankt","type":"user"},"name":"Karan Thakkar","status":"claimed_verified","statusLastChangedAt":"2025-06-07T05:49:42.588Z","hidden":false},{"_id":"683fa315cd4314c4b3d53455","user":{"_id":"64092a1ab6a334f53e278b3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64092a1ab6a334f53e278b3b/tcueLWyyDL6WMUTw3Or4t.jpeg","isPro":false,"fullname":"Tiantian Feng","user":"tiantiaf","type":"user"},"name":"Tiantian Feng","status":"claimed_verified","statusLastChangedAt":"2025-06-11T08:37:02.607Z","hidden":false},{"_id":"683fa315cd4314c4b3d53456","name":"Dongchao Yang","hidden":false},{"_id":"683fa315cd4314c4b3d53457","name":"Junhyeok Lee","hidden":false},{"_id":"683fa315cd4314c4b3d53458","name":"Laureano Moro Velazquez","hidden":false},{"_id":"683fa315cd4314c4b3d53459","name":"Jesus Villalba","hidden":false},{"_id":"683fa315cd4314c4b3d5345a","name":"Zengyi Qin","hidden":false},{"_id":"683fa315cd4314c4b3d5345b","name":"Shrikanth Narayanan","hidden":false},{"_id":"683fa315cd4314c4b3d5345c","name":"Mounya Elhiali","hidden":false},{"_id":"683fa315cd4314c4b3d5345d","name":"Najim Dehak","hidden":false}],"publishedAt":"2025-06-03T13:28:55.000Z","submittedOnDailyAt":"2025-06-05T19:00:43.134Z","title":"CapSpeech: Enabling Downstream Applications in Style-Captioned\n Text-to-Speech","submittedOnDailyBy":{"_id":"63ecfb5ec5b3c734085db9ed","avatarUrl":"/avatars/0b1d03dcd7997ad1daa764fb76f88993.svg","isPro":false,"fullname":"Helin Wang","user":"westbrook","type":"user"},"summary":"Recent advancements in generative artificial intelligence have significantly\ntransformed the field of style-captioned text-to-speech synthesis (CapTTS).\nHowever, adapting CapTTS to real-world applications remains challenging due to\nthe lack of standardized, comprehensive datasets and limited research on\ndownstream tasks built upon CapTTS. To address these gaps, we introduce\nCapSpeech, a new benchmark designed for a series of CapTTS-related tasks,\nincluding style-captioned text-to-speech synthesis with sound events\n(CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS\n(EmoCapTTS), and text-to-speech synthesis for chat agent (AgentTTS). CapSpeech\ncomprises over 10 million machine-annotated audio-caption pairs and nearly 0.36\nmillion human-annotated audio-caption pairs. In addition, we introduce two new\ndatasets collected and recorded by a professional voice actor and experienced\naudio engineers, specifically for the AgentTTS and CapTTS-SE tasks. Alongside\nthe datasets, we conduct comprehensive experiments using both autoregressive\nand non-autoregressive models on CapSpeech. Our results demonstrate\nhigh-fidelity and highly intelligible speech synthesis across a diverse range\nof speaking styles. To the best of our knowledge, CapSpeech is the largest\navailable dataset offering comprehensive annotations for CapTTS-related tasks.\nThe experiments and findings further provide valuable insights into the\nchallenges of developing CapTTS systems.","upvotes":8,"discussionId":"683fa316cd4314c4b3d53485","projectPage":"https://wanghelin1997.github.io/CapSpeech-demo/","githubRepo":"https://github.com/WangHelin1997/CapSpeech","githubRepoAddedBy":"user","ai_summary":"CapSpeech introduces a large benchmark dataset for various captioned text-to-speech tasks, facilitating advancements in style, accent, emotion, and chat-agent synthesis.","ai_keywords":["autoregressive models","non-autoregressive models","CapTTS","CapTTS-SE","AccCapTTS","EmoCapTTS","AgentTTS","machine-annotated","human-annotated"],"githubStars":366},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64092a1ab6a334f53e278b3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64092a1ab6a334f53e278b3b/tcueLWyyDL6WMUTw3Or4t.jpeg","isPro":false,"fullname":"Tiantian Feng","user":"tiantiaf","type":"user"},{"_id":"66e25e5e96acd45223d7a167","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66e25e5e96acd45223d7a167/0Nk8QtwfpRe4ElhTjuv51.png","isPro":true,"fullname":"OpenSound","user":"OpenSound","type":"user"},{"_id":"63ecfb5ec5b3c734085db9ed","avatarUrl":"/avatars/0b1d03dcd7997ad1daa764fb76f88993.svg","isPro":false,"fullname":"Helin Wang","user":"westbrook","type":"user"},{"_id":"643b19f8a856622f978df30f","avatarUrl":"/avatars/c82779fdf94f80cdb5020504f83c818b.svg","isPro":false,"fullname":"Yatharth Sharma","user":"YaTharThShaRma999","type":"user"},{"_id":"6363e9e4969bdae89e10b5ac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6363e9e4969bdae89e10b5ac/uR0jz8pBJIQPmKC90OJgI.jpeg","isPro":false,"fullname":"Jiarui Hai","user":"Higobeatz","type":"user"},{"_id":"634452e00a284e1e9f1d3d9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/634452e00a284e1e9f1d3d9d/jlJXRNxg4OH-zT8z9oDrr.jpeg","isPro":false,"fullname":"Karan Thakkar","user":"carankt","type":"user"},{"_id":"64f04433f3cd962c217195f3","avatarUrl":"/avatars/11781e7f703f57e9d629c74f19fb635e.svg","isPro":false,"fullname":"9tlofrjjlcq5","user":"9tlofrjjlcq5","type":"user"},{"_id":"65597152727df37c776bf2b9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65597152727df37c776bf2b9/W5InlJOs5w8dXjGBrk0QQ.png","isPro":false,"fullname":"9-Volt Fan","user":"9voltfan2009","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">

Papers

arxiv:2506.02863

CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech

Published on Jun 3, 2025

· Submitted by

Helin Wang on Jun 5, 2025

Upvote

Authors:

Helin Wang ,

Jiarui Hai ,

Karan Thakkar ,

Tiantian Feng ,

Abstract

CapSpeech introduces a large benchmark dataset for various captioned text-to-speech tasks, facilitating advancements in style, accent, emotion, and chat-agent synthesis.

AI-generated summary

Recent advancements in generative artificial intelligence have significantly transformed the field of style-captioned text-to-speech synthesis (CapTTS). However, adapting CapTTS to real-world applications remains challenging due to the lack of standardized, comprehensive datasets and limited research on downstream tasks built upon CapTTS. To address these gaps, we introduce CapSpeech, a new benchmark designed for a series of CapTTS-related tasks, including style-captioned text-to-speech synthesis with sound events (CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS (EmoCapTTS), and text-to-speech synthesis for chat agent (AgentTTS). CapSpeech comprises over 10 million machine-annotated audio-caption pairs and nearly 0.36 million human-annotated audio-caption pairs. In addition, we introduce two new datasets collected and recorded by a professional voice actor and experienced audio engineers, specifically for the AgentTTS and CapTTS-SE tasks. Alongside the datasets, we conduct comprehensive experiments using both autoregressive and non-autoregressive models on CapSpeech. Our results demonstrate high-fidelity and highly intelligible speech synthesis across a diverse range of speaking styles. To the best of our knowledge, CapSpeech is the largest available dataset offering comprehensive annotations for CapTTS-related tasks. The experiments and findings further provide valuable insights into the challenges of developing CapTTS systems.