Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Qwen3-ASR Technical Report
[go: Go Back, main page]

https://arxivlens.com/PaperView/Details/qwen3-asr-technical-report-232-4fcaadd0

\n
    \n
  • Executive Summary
  • \n
  • Detailed Breakdown
  • \n
  • Practical Applications
  • \n
\n","updatedAt":"2026-01-30T22:23:59.618Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6380227208137512},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}},{"id":"697d5cdc891a824242e4cbdf","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-01-31T01:37:32.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [LLM-ForcedAligner: A Non-Autoregressive and Accurate LLM-Based Forced Aligner for Multilingual and Long-Form Speech](https://huggingface.co/papers/2601.18220) (2026)\n* [Index-ASR Technical Report](https://huggingface.co/papers/2601.00890) (2025)\n* [Qwen3-TTS Technical Report](https://huggingface.co/papers/2601.15621) (2026)\n* [IndexTTS 2.5 Technical Report](https://huggingface.co/papers/2601.03888) (2026)\n* [VIBEVOICE-ASR Technical Report](https://huggingface.co/papers/2601.18184) (2026)\n* [LEMAS: Large A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models](https://huggingface.co/papers/2601.04233) (2026)\n* [Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs](https://huggingface.co/papers/2512.16378) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-01-31T01:37:32.076Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7084090113639832},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.21337","authors":[{"_id":"697c1ebea67238fac88cc0ae","name":"Xian Shi","hidden":false},{"_id":"697c1ebea67238fac88cc0af","name":"Xiong Wang","hidden":false},{"_id":"697c1ebea67238fac88cc0b0","name":"Zhifang Guo","hidden":false},{"_id":"697c1ebea67238fac88cc0b1","name":"Yongqi Wang","hidden":false},{"_id":"697c1ebea67238fac88cc0b2","name":"Pei Zhang","hidden":false},{"_id":"697c1ebea67238fac88cc0b3","name":"Xinyu Zhang","hidden":false},{"_id":"697c1ebea67238fac88cc0b4","name":"Zishan Guo","hidden":false},{"_id":"697c1ebea67238fac88cc0b5","name":"Hongkun Hao","hidden":false},{"_id":"697c1ebea67238fac88cc0b6","name":"Yu Xi","hidden":false},{"_id":"697c1ebea67238fac88cc0b7","name":"Baosong Yang","hidden":false},{"_id":"697c1ebea67238fac88cc0b8","name":"Jin Xu","hidden":false},{"_id":"697c1ebea67238fac88cc0b9","name":"Jingren Zhou","hidden":false},{"_id":"697c1ebea67238fac88cc0ba","name":"Junyang Lin","hidden":false}],"publishedAt":"2026-01-29T06:58:13.000Z","submittedOnDailyAt":"2026-01-30T02:17:20.204Z","title":"Qwen3-ASR Technical Report","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},"summary":"In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects. Both of them leverage large-scale speech training data and the strong audio understanding ability of their foundation model Qwen3-Omni. We conduct comprehensive internal evaluation besides the open-sourced benchmarks as ASR models might differ little on open-sourced benchmark scores but exhibit significant quality differences in real-world scenarios. The experiments reveal that the 1.7B version achieves SOTA performance among open-sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy-efficiency trade-off. Qwen3-ASR-0.6B can achieve an average TTFT as low as 92ms and transcribe 2000 seconds speech in 1 second at a concurrency of 128. Qwen3-ForcedAligner-0.6B is an LLM based NAR timestamp predictor that is able to align text-speech pairs in 11 languages. Timestamp accuracy experiments show that the proposed model outperforms the three strongest force alignment models and takes more advantages in efficiency and versatility. To further accelerate the community research of ASR and audio understanding, we release these models under the Apache 2.0 license.","upvotes":36,"discussionId":"697c1ebea67238fac88cc0bb","ai_summary":"The Qwen3-ASR family introduces speech recognition models with language identification capabilities and a non-autoregressive forced alignment model, achieving state-of-the-art performance and efficient processing.","ai_keywords":["speech recognition models","language identification","non-autoregressive models","forced alignment","timestamp prediction","audio understanding","large-scale speech training data","foundation model","TTFT","concurrency"],"organization":{"_id":"64c8b5837fe12ecd0a7e92eb","name":"Qwen","fullname":"Qwen","avatar":"https://cdn-uploads.huggingface.co/production/uploads/620760a26e3b7210c2ff1943/-s1gyJfvbE1RgO5iBeNOi.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64b26c035e1230a79f897880","avatarUrl":"/avatars/5427b05b3ef627d4d8281f9a33bb98ab.svg","isPro":false,"fullname":"zhangwenbin","user":"ExceedZhang","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"62bbd9578289ef30dd7a62e7","avatarUrl":"/avatars/41df6546efbf44678fc4e3f3cfc41ed4.svg","isPro":false,"fullname":"Bryan Teo","user":"bteo98","type":"user"},{"_id":"6752d04f880bb6c3ea27eec9","avatarUrl":"/avatars/f5f1a20514b39bbe115e87f7ac61820f.svg","isPro":false,"fullname":"Shriram","user":"shriram-17","type":"user"},{"_id":"64b2f97434a92b848c7e941e","avatarUrl":"/avatars/c699c50f3b43cd1641469521127753bb.svg","isPro":false,"fullname":"Nagori","user":"MohammedNaeem","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6396c12ce794f987a7718c2c","avatarUrl":"/avatars/a1a2b52db922072d40f1f38890ace87d.svg","isPro":false,"fullname":"lxf","user":"codding","type":"user"},{"_id":"6614ffe1d59f3657d7ee04eb","avatarUrl":"/avatars/51285bbf93f2d7ec235a87250e4a2ffc.svg","isPro":false,"fullname":"minghao qin","user":"CharmingDog","type":"user"},{"_id":"65c4eb7cd1dcbd30d86febec","avatarUrl":"/avatars/001c8f02e8ce794b2c21883628b2da72.svg","isPro":false,"fullname":"free-bit","user":"free-bit","type":"user"},{"_id":"62430a8522549d0917bfeb5a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62430a8522549d0917bfeb5a/l8jr2cvCp9YBK41XaV27R.jpeg","isPro":false,"fullname":"cheng","user":"littlebird13","type":"user"},{"_id":"627b04e09ef63e604f24d660","avatarUrl":"/avatars/6836932945dbe04c398bec23bcae6262.svg","isPro":false,"fullname":"dooho lee","user":"BlueYellowGreen","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"64c8b5837fe12ecd0a7e92eb","name":"Qwen","fullname":"Qwen","avatar":"https://cdn-uploads.huggingface.co/production/uploads/620760a26e3b7210c2ff1943/-s1gyJfvbE1RgO5iBeNOi.png"}}">
Papers
arxiv:2601.21337

Qwen3-ASR Technical Report

Published on Jan 29
ยท Submitted by
taesiri
on Jan 30
ยท Qwen Qwen
Authors:
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

The Qwen3-ASR family introduces speech recognition models with language identification capabilities and a non-autoregressive forced alignment model, achieving state-of-the-art performance and efficient processing.

AI-generated summary

In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects. Both of them leverage large-scale speech training data and the strong audio understanding ability of their foundation model Qwen3-Omni. We conduct comprehensive internal evaluation besides the open-sourced benchmarks as ASR models might differ little on open-sourced benchmark scores but exhibit significant quality differences in real-world scenarios. The experiments reveal that the 1.7B version achieves SOTA performance among open-sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy-efficiency trade-off. Qwen3-ASR-0.6B can achieve an average TTFT as low as 92ms and transcribe 2000 seconds speech in 1 second at a concurrency of 128. Qwen3-ForcedAligner-0.6B is an LLM based NAR timestamp predictor that is able to align text-speech pairs in 11 languages. Timestamp accuracy experiments show that the proposed model outperforms the three strongest force alignment models and takes more advantages in efficiency and versatility. To further accelerate the community research of ASR and audio understanding, we release these models under the Apache 2.0 license.

Community

Paper submitter

Qwen3-ASR delivers two all-in-one ASR models with 52-language support and a non-autoregressive forced-aligner; achieves competitive SOTA accuracy, fast TTFT, and open-source Apache 2.0 release.

arXivLens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/qwen3-asr-technical-report-232-4fcaadd0

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 13

Browse 13 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.21337 in a dataset README.md to link it from this page.

Spaces citing this paper 25

Collections including this paper 8