Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Qwen3-TTS Technical Report
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-01-24T01:39:05.633Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7068952918052673},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.15621","authors":[{"_id":"6972e310fb12c92b735b746a","name":"Hangrui Hu","hidden":false},{"_id":"6972e310fb12c92b735b746b","name":"Xinfa Zhu","hidden":false},{"_id":"6972e310fb12c92b735b746c","name":"Ting He","hidden":false},{"_id":"6972e310fb12c92b735b746d","name":"Dake Guo","hidden":false},{"_id":"6972e310fb12c92b735b746e","name":"Bin Zhang","hidden":false},{"_id":"6972e310fb12c92b735b746f","user":{"_id":"664eaf0a98e93ef417c3cc42","avatarUrl":"/avatars/67fb44351cac8964410e5b6549817182.svg","isPro":false,"fullname":"Xiong Wang","user":"xiongwang","type":"user"},"name":"Xiong Wang","status":"claimed_verified","statusLastChangedAt":"2026-01-23T20:15:12.488Z","hidden":false},{"_id":"6972e310fb12c92b735b7470","name":"Zhifang Guo","hidden":false},{"_id":"6972e310fb12c92b735b7471","user":{"_id":"67dbdf261956dcedf0f0a7e1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/BRzXe_7jytEYadJ0byMyD.png","isPro":false,"fullname":"ZiyueJiang","user":"ZiyueJiang","type":"user"},"name":"Ziyue Jiang","status":"admin_assigned","statusLastChangedAt":"2026-01-23T09:48:40.953Z","hidden":false},{"_id":"6972e310fb12c92b735b7472","user":{"_id":"666fd382b955b0e655165768","avatarUrl":"/avatars/66476925471bda2dc9b57f091f245dd9.svg","isPro":false,"fullname":"hongkun hao","user":"hongkunhao","type":"user"},"name":"Hongkun Hao","status":"admin_assigned","statusLastChangedAt":"2026-01-23T09:48:26.167Z","hidden":false},{"_id":"6972e310fb12c92b735b7473","user":{"_id":"6370e58a967e405db11cf788","avatarUrl":"/avatars/369ed523fd7f9ac08baf2d3b4b2d8426.svg","isPro":false,"fullname":"guozishan","user":"jimapple","type":"user"},"name":"Zishan Guo","status":"admin_assigned","statusLastChangedAt":"2026-01-23T09:48:20.045Z","hidden":false},{"_id":"6972e310fb12c92b735b7474","name":"Xinyu Zhang","hidden":false},{"_id":"6972e310fb12c92b735b7475","name":"Pei Zhang","hidden":false},{"_id":"6972e310fb12c92b735b7476","user":{"_id":"64b0a77df12b47366663884c","avatarUrl":"/avatars/a212ea862abb5966060e439dd0e7656f.svg","isPro":false,"fullname":"Baosong Yang","user":"Baosong","type":"user"},"name":"Baosong Yang","status":"admin_assigned","statusLastChangedAt":"2026-01-23T09:48:01.836Z","hidden":false},{"_id":"6972e310fb12c92b735b7477","name":"Jin Xu","hidden":false},{"_id":"6972e310fb12c92b735b7478","name":"Jingren Zhou","hidden":false},{"_id":"6972e310fb12c92b735b7479","user":{"_id":"620760a26e3b7210c2ff1943","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620760a26e3b7210c2ff1943/VC-rKqimF6yxGESNVlPoR.jpeg","isPro":false,"fullname":"Junyang Lin","user":"JustinLin610","type":"user"},"name":"Junyang Lin","status":"admin_assigned","statusLastChangedAt":"2026-01-23T09:47:54.300Z","hidden":false}],"publishedAt":"2026-01-22T03:51:43.000Z","submittedOnDailyAt":"2026-01-23T00:25:20.168Z","title":"Qwen3-TTS Technical Report","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},"summary":"In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission (97,ms) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.","upvotes":67,"discussionId":"6972e311fb12c92b735b747a","githubRepo":"https://github.com/QwenLM/Qwen3-TTS","githubRepoAddedBy":"user","ai_summary":"The Qwen3-TTS series presents advanced multilingual text-to-speech models with voice cloning and controllable speech generation capabilities, utilizing dual-track LM architecture and specialized speech tokenizers for efficient streaming synthesis.","ai_keywords":["text-to-speech","voice cloning","dual-track LM architecture","speech tokenizers","Qwen-TTS-Tokenizer-25Hz","Qwen-TTS-Tokenizer-12Hz","DiT","ConvNet","streaming waveform reconstruction","multilingual","controllable speech generation"],"githubStars":8040,"organization":{"_id":"64c8b5837fe12ecd0a7e92eb","name":"Qwen","fullname":"Qwen","avatar":"https://cdn-uploads.huggingface.co/production/uploads/620760a26e3b7210c2ff1943/-s1gyJfvbE1RgO5iBeNOi.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"62430a8522549d0917bfeb5a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62430a8522549d0917bfeb5a/l8jr2cvCp9YBK41XaV27R.jpeg","isPro":false,"fullname":"cheng","user":"littlebird13","type":"user"},{"_id":"655e4c26d5c0d3db535cdd66","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655e4c26d5c0d3db535cdd66/7gUJ8urq7mEZ4OE4ppQCj.png","isPro":false,"fullname":"Lincoln","user":"Presidentlin","type":"user"},{"_id":"6528a57bf0042c8301d217dc","avatarUrl":"/avatars/b7e1398aec545a0342c05c67c5493c8b.svg","isPro":false,"fullname":"HanSaem Kim","user":"kensaem","type":"user"},{"_id":"655601f1ae085c2ba7a22b95","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/4UmxFrc_TEiXcnm3RewZM.jpeg","isPro":false,"fullname":"Xiaoji Zheng","user":"Student-Xiaoji","type":"user"},{"_id":"634ec067aae4bde2c8dfc86f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/634ec067aae4bde2c8dfc86f/OQBLKcspofUqAzmEpvH0-.png","isPro":false,"fullname":"Yamata Zen","user":"yamatazen","type":"user"},{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","isPro":false,"fullname":"Adina Yakefu","user":"AdinaY","type":"user"},{"_id":"65c4eb7cd1dcbd30d86febec","avatarUrl":"/avatars/001c8f02e8ce794b2c21883628b2da72.svg","isPro":false,"fullname":"free-bit","user":"free-bit","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"64b2f97434a92b848c7e941e","avatarUrl":"/avatars/c699c50f3b43cd1641469521127753bb.svg","isPro":false,"fullname":"Nagori","user":"MohammedNaeem","type":"user"},{"_id":"69737e4e614bce562c158571","avatarUrl":"/avatars/0ecfbaddabd19f1267e9c98ab4ddf455.svg","isPro":false,"fullname":"Nuoli Tang","user":"nuoli2001","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"64c8b5837fe12ecd0a7e92eb","name":"Qwen","fullname":"Qwen","avatar":"https://cdn-uploads.huggingface.co/production/uploads/620760a26e3b7210c2ff1943/-s1gyJfvbE1RgO5iBeNOi.png"}}">
Papers
arxiv:2601.15621

Qwen3-TTS Technical Report

Published on Jan 22
ยท Submitted by
taesiri
on Jan 23
ยท Qwen Qwen
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

The Qwen3-TTS series presents advanced multilingual text-to-speech models with voice cloning and controllable speech generation capabilities, utilizing dual-track LM architecture and specialized speech tokenizers for efficient streaming synthesis.

AI-generated summary

In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission (97,ms) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 62

Browse 62 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.15621 in a dataset README.md to link it from this page.

Spaces citing this paper 675

Collections including this paper 14