Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization
[go: Go Back, main page]

[2601.01554] MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization
Homepage: https://mosi.cn/models/moss-transcribe-diarize
Online Demo: https://moss-transcribe-diarize-demo.mosi.cn

\n","updatedAt":"2026-01-07T03:22:16.131Z","author":{"_id":"629ef8544313a7c1dd671130","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/629ef8544313a7c1dd671130/i5xfHIgELcuO1Ew19ebTw.png","fullname":"Zhaoye Fei","name":"ngc7293","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6729779243469238},"editors":["ngc7293"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/629ef8544313a7c1dd671130/i5xfHIgELcuO1Ew19ebTw.png"],"reactions":[],"isReport":false},"replies":[{"id":"696037d9ceb27816e19e66e1","author":{"_id":"66fda55415148a910c99269e","avatarUrl":"/avatars/0a886a511b1f356642f92caf7222acc9.svg","fullname":"Elliott Dyson","name":"ElliottDysonDesigns","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2026-01-08T23:03:53.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Some very interesting work. Very much hoping for at least an open-weight release 🤞.\n\nP.s. In case you were not already aware, your website is currently down.","html":"

Some very interesting work. Very much hoping for at least an open-weight release 🤞.

\n

P.s. In case you were not already aware, your website is currently down.

\n","updatedAt":"2026-01-08T23:04:06.951Z","author":{"_id":"66fda55415148a910c99269e","avatarUrl":"/avatars/0a886a511b1f356642f92caf7222acc9.svg","fullname":"Elliott Dyson","name":"ElliottDysonDesigns","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9943162202835083},"editors":["ElliottDysonDesigns"],"editorAvatarUrls":["/avatars/0a886a511b1f356642f92caf7222acc9.svg"],"reactions":[{"reaction":"👀","users":["ngc7293"],"count":1}],"isReport":false,"parentCommentId":"695dd1681731e22c71c08c8b"}},{"id":"6961c0ff1b1a1d202845b425","author":{"_id":"629ef8544313a7c1dd671130","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/629ef8544313a7c1dd671130/i5xfHIgELcuO1Ew19ebTw.png","fullname":"Zhaoye Fei","name":"ngc7293","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false},"createdAt":"2026-01-10T03:01:19.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thank you so much — really appreciate the kind words and your interest.\nWe’re planning to release an open-weight model in the coming months.\nAlso, thanks a lot for the heads-up about the website — we’re on it.","html":"

Thank you so much — really appreciate the kind words and your interest.
We’re planning to release an open-weight model in the coming months.
Also, thanks a lot for the heads-up about the website — we’re on it.

\n","updatedAt":"2026-01-10T03:01:19.144Z","author":{"_id":"629ef8544313a7c1dd671130","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/629ef8544313a7c1dd671130/i5xfHIgELcuO1Ew19ebTw.png","fullname":"Zhaoye Fei","name":"ngc7293","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9499610066413879},"editors":["ngc7293"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/629ef8544313a7c1dd671130/i5xfHIgELcuO1Ew19ebTw.png"],"reactions":[{"reaction":"🔥","users":["ElliottDysonDesigns","kiiic"],"count":2}],"isReport":false,"parentCommentId":"695dd1681731e22c71c08c8b"}},{"id":"69709456e62a79d007b6c614","author":{"_id":"629ef8544313a7c1dd671130","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/629ef8544313a7c1dd671130/i5xfHIgELcuO1Ew19ebTw.png","fullname":"Zhaoye Fei","name":"ngc7293","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false},"createdAt":"2026-01-21T08:54:46.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@ElliottDysonDesigns Hi, we’ve recently launched an API interface ( https://studio.mosi.cn/docs/moss-transcribe-diarize ) for integration and usage. The open-source version is coming soon—stay tuned!","html":"

\n\n@ElliottDysonDesigns\n\t Hi, we’ve recently launched an API interface ( https://studio.mosi.cn/docs/moss-transcribe-diarize ) for integration and usage. The open-source version is coming soon—stay tuned!

\n","updatedAt":"2026-01-21T08:54:46.905Z","author":{"_id":"629ef8544313a7c1dd671130","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/629ef8544313a7c1dd671130/i5xfHIgELcuO1Ew19ebTw.png","fullname":"Zhaoye Fei","name":"ngc7293","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7988746762275696},"editors":["ngc7293"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/629ef8544313a7c1dd671130/i5xfHIgELcuO1Ew19ebTw.png"],"reactions":[],"isReport":false,"parentCommentId":"695dd1681731e22c71c08c8b"}}]},{"id":"695e6b7553ed237b6c2da64e","author":{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","fullname":"Adina Yakefu","name":"AdinaY","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1145,"isUserFollowing":false},"createdAt":"2026-01-07T14:19:33.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Verycool & useful work! Is the model going to be open source? ","html":"

Verycool & useful work! Is the model going to be open source?

\n","updatedAt":"2026-01-07T14:19:33.915Z","author":{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","fullname":"Adina Yakefu","name":"AdinaY","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1145,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8282871246337891},"editors":["AdinaY"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg"],"reactions":[{"reaction":"👀","users":["schauppi","ckosten","ngc7293","yhzx233"],"count":4}],"isReport":false},"replies":[{"id":"6961c1b07e8350085b19c9b1","author":{"_id":"629ef8544313a7c1dd671130","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/629ef8544313a7c1dd671130/i5xfHIgELcuO1Ew19ebTw.png","fullname":"Zhaoye Fei","name":"ngc7293","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false},"createdAt":"2026-01-10T03:04:16.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thank you for your interest, and we plan to open-source it in the coming months.","html":"

Thank you for your interest, and we plan to open-source it in the coming months.

\n","updatedAt":"2026-01-10T03:04:16.212Z","author":{"_id":"629ef8544313a7c1dd671130","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/629ef8544313a7c1dd671130/i5xfHIgELcuO1Ew19ebTw.png","fullname":"Zhaoye Fei","name":"ngc7293","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9186410903930664},"editors":["ngc7293"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/629ef8544313a7c1dd671130/i5xfHIgELcuO1Ew19ebTw.png"],"reactions":[{"reaction":"🔥","users":["AdinaY","yhzx233","Shreesh-Coder","kiiic"],"count":4}],"isReport":false,"parentCommentId":"695e6b7553ed237b6c2da64e"}}]},{"id":"695f09f2ed3fd4cece5a748f","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-01-08T01:35:46.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio](https://huggingface.co/papers/2511.16046) (2025)\n* [JoyVoice: Long-Context Conditioning for Anthropomorphic Multi-Speaker Conversational Synthesis](https://huggingface.co/papers/2512.19090) (2025)\n* [Context-Aware Whisper for Arabic ASR Under Linguistic Varieties](https://huggingface.co/papers/2511.18774) (2025)\n* [Toward Conversational Hungarian Speech Recognition: Introducing the BEA-Large and BEA-Dialogue Datasets](https://huggingface.co/papers/2511.13529) (2025)\n* [VALLR-Pin: Uncertainty-Factorized Visual Speech Recognition for Mandarin with Pinyin Guidance](https://huggingface.co/papers/2512.20032) (2025)\n* [VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context](https://huggingface.co/papers/2511.08230) (2025)\n* [RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data](https://huggingface.co/papers/2511.20974) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-01-08T01:35:46.840Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6881776452064514},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[{"reaction":"👍","users":["ttyu222"],"count":1}],"isReport":false}},{"id":"69791b5840a1ca431eb702f9","author":{"_id":"631c8ed571f8e7137df35838","avatarUrl":"/avatars/959ed1641ca190af17adb967514b0607.svg","fullname":"Shreesh Gupta","name":"Shreesh-Coder","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-01-27T20:08:56.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Really useful work, I'm excited to see the open-weight release!","html":"

Really useful work, I'm excited to see the open-weight release!

\n","updatedAt":"2026-01-27T20:08:56.089Z","author":{"_id":"631c8ed571f8e7137df35838","avatarUrl":"/avatars/959ed1641ca190af17adb967514b0607.svg","fullname":"Shreesh Gupta","name":"Shreesh-Coder","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9697315096855164},"editors":["Shreesh-Coder"],"editorAvatarUrls":["/avatars/959ed1641ca190af17adb967514b0607.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.01554","authors":[{"_id":"695dcda5c03d6d81e4399eb8","name":"MOSI. AI","hidden":false},{"_id":"695dcda5c03d6d81e4399eb9","user":{"_id":"630501ee34c824b17250dea3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630501ee34c824b17250dea3/1muf-A-SvXYzr9yjXi1Ev.jpeg","isPro":false,"fullname":"Donghua Yu","user":"yhzx233","type":"user"},"name":"Donghua Yu","status":"claimed_verified","statusLastChangedAt":"2026-01-09T08:36:37.004Z","hidden":false},{"_id":"695dcda5c03d6d81e4399eba","name":"Zhengyuan Lin","hidden":false},{"_id":"695dcda5c03d6d81e4399ebb","user":{"_id":"660c345da15ab85523ad00d1","avatarUrl":"/avatars/b0bfdee89a6c62ff12140b9e85de499a.svg","isPro":false,"fullname":"Chen Yang","user":"kiiic","type":"user"},"name":"Chen Yang","status":"claimed_verified","statusLastChangedAt":"2026-01-07T09:25:58.894Z","hidden":false},{"_id":"695dcda5c03d6d81e4399ebc","name":"Yiyang Zhang","hidden":false},{"_id":"695dcda5c03d6d81e4399ebd","name":"Hanfu Chen","hidden":false},{"_id":"695dcda5c03d6d81e4399ebe","name":"Jingqi Chen","hidden":false},{"_id":"695dcda5c03d6d81e4399ebf","name":"Ke Chen","hidden":false},{"_id":"695dcda5c03d6d81e4399ec0","name":"Liwei Fan","hidden":false},{"_id":"695dcda5c03d6d81e4399ec1","name":"Yi Jiang","hidden":false},{"_id":"695dcda5c03d6d81e4399ec2","name":"Jie Zhu","hidden":false},{"_id":"695dcda5c03d6d81e4399ec3","name":"Muchen Li","hidden":false},{"_id":"695dcda5c03d6d81e4399ec4","name":"Wenxuan Wang","hidden":false},{"_id":"695dcda5c03d6d81e4399ec5","name":"Yang Wang","hidden":false},{"_id":"695dcda5c03d6d81e4399ec6","user":{"_id":"6443f7bf1bc692d87b25e234","avatarUrl":"/avatars/fa9e62d96d0691a9a48e3db499a61557.svg","isPro":false,"fullname":"Xu Zhe","user":"Phospheneser","type":"user"},"name":"Zhe Xu","status":"claimed_verified","statusLastChangedAt":"2026-01-07T09:25:55.950Z","hidden":false},{"_id":"695dcda5c03d6d81e4399ec7","user":{"_id":"66c893b2e51ba3009235b1c0","avatarUrl":"/avatars/5341d40b4b4caca2c145a46eb1754582.svg","isPro":false,"fullname":"yitian gong","user":"fdugyt","type":"user"},"name":"Yitian Gong","status":"claimed_verified","statusLastChangedAt":"2026-02-12T13:58:11.676Z","hidden":false},{"_id":"695dcda5c03d6d81e4399ec8","user":{"_id":"66e16350b3dc9ddf4ef6b215","avatarUrl":"/avatars/5476617fca1e4982e0793ee6f51aec80.svg","isPro":false,"fullname":"zyq","user":"rulerman","type":"user"},"name":"Yuqian Zhang","status":"claimed_verified","statusLastChangedAt":"2026-01-09T08:36:35.025Z","hidden":false},{"_id":"695dcda5c03d6d81e4399ec9","name":"Wenbo Zhang","hidden":false},{"_id":"695dcda5c03d6d81e4399eca","user":{"_id":"629ef8544313a7c1dd671130","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/629ef8544313a7c1dd671130/i5xfHIgELcuO1Ew19ebTw.png","isPro":false,"fullname":"Zhaoye Fei","user":"ngc7293","type":"user"},"name":"Zhaoye Fei","status":"admin_assigned","statusLastChangedAt":"2026-01-07T13:17:10.124Z","hidden":false},{"_id":"695dcda5c03d6d81e4399ecb","user":{"_id":"695757e4fd9dc6e9bac27935","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/_uZEu4oOlKJVYqrG763Z-.jpeg","isPro":false,"fullname":"aa","user":"qinyuancheng","type":"user"},"name":"Qinyuan Cheng","status":"admin_assigned","statusLastChangedAt":"2026-01-07T13:17:03.749Z","hidden":false},{"_id":"695dcda5c03d6d81e4399ecc","name":"Shimin Li","hidden":false},{"_id":"695dcda5c03d6d81e4399ecd","user":{"_id":"61457b8deff2c9fdb4de4988","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1632381702899-61457b8deff2c9fdb4de4988.jpeg","isPro":false,"fullname":"Xipeng Qiu","user":"xpqiu","type":"user"},"name":"Xipeng Qiu","status":"admin_assigned","statusLastChangedAt":"2026-01-07T13:16:50.004Z","hidden":false}],"publishedAt":"2026-01-04T15:01:10.000Z","submittedOnDailyAt":"2026-01-07T00:52:16.123Z","title":"MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization","submittedOnDailyBy":{"_id":"629ef8544313a7c1dd671130","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/629ef8544313a7c1dd671130/i5xfHIgELcuO1Ew19ebTw.png","isPro":false,"fullname":"Zhaoye Fei","user":"ngc7293","type":"user"},"summary":"Speaker-Attributed, Time-Stamped Transcription (SATS) aims to transcribe what is said and to precisely determine the timing of each speaker, which is particularly valuable for meeting transcription. Existing SATS systems rarely adopt an end-to-end formulation and are further constrained by limited context windows, weak long-range speaker memory, and the inability to output timestamps. To address these limitations, we present MOSS Transcribe Diarize, a unified multimodal large language model that jointly performs Speaker-Attributed, Time-Stamped Transcription in an end-to-end paradigm. Trained on extensive real wild data and equipped with a 128k context window for up to 90-minute inputs, MOSS Transcribe Diarize scales well and generalizes robustly. Across comprehensive evaluations, it outperforms state-of-the-art commercial systems on multiple public and in-house benchmarks.","upvotes":57,"discussionId":"695dcda6c03d6d81e4399ece","projectPage":"https://mosi.cn/models/moss-transcribe-diarize","ai_summary":"A unified multimodal large language model for end-to-end speaker-attributed, time-stamped transcription with extended context window and strong generalization across benchmarks.","ai_keywords":["multimodal large language model","end-to-end paradigm","speaker diarization","time-stamped transcription","context window","robust generalization"],"organization":{"_id":"613b0dee83ec35d460684607","name":"OpenMOSS-Team","fullname":"OpenMOSS","avatar":"https://cdn-uploads.huggingface.co/production/uploads/61457b8deff2c9fdb4de4988/N5b9663zQ4uq5_OTNlnmw.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66e4426345da0a1b7e87ba21","avatarUrl":"/avatars/fc0504350d6bdcd28e51e47a64d4efcb.svg","isPro":false,"fullname":"Luoo","user":"QiiLuoo","type":"user"},{"_id":"65372c7f29f89004f33575e6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/L4lBX-Ub8w2gwkMDw6q1q.jpeg","isPro":false,"fullname":"Fuhao","user":"CMyName","type":"user"},{"_id":"63ec4715c81b6a52391c46b8","avatarUrl":"/avatars/496819b5075a1a834a2b9edeb068c80e.svg","isPro":false,"fullname":"QinyuanCheng","user":"Cqy2019","type":"user"},{"_id":"67ed48e32fca4b0a7c4ad7cb","avatarUrl":"/avatars/f3abe4f19f32756ec468c03540da4f70.svg","isPro":false,"fullname":"zhujie","user":"damusidian","type":"user"},{"_id":"66fe92d439eed2542a4e1db8","avatarUrl":"/avatars/251ba252af69c365ee750b8f1ba11de3.svg","isPro":false,"fullname":"JackZhou","user":"kongshanxinyuhouqing","type":"user"},{"_id":"62d001c180c9d4ceb2012a5d","avatarUrl":"/avatars/ab1f522ab5db97aa108985623004903d.svg","isPro":false,"fullname":"常成","user":"MCplayer","type":"user"},{"_id":"630501ee34c824b17250dea3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630501ee34c824b17250dea3/1muf-A-SvXYzr9yjXi1Ev.jpeg","isPro":false,"fullname":"Donghua Yu","user":"yhzx233","type":"user"},{"_id":"6443f7bf1bc692d87b25e234","avatarUrl":"/avatars/fa9e62d96d0691a9a48e3db499a61557.svg","isPro":false,"fullname":"Xu Zhe","user":"Phospheneser","type":"user"},{"_id":"6346b4e7fa79ac99a3ad12ee","avatarUrl":"/avatars/cf72b76a33c5779b049faf7bf6ec5070.svg","isPro":false,"fullname":"Yang Gao","user":"gaoyang07","type":"user"},{"_id":"603209083d343696d96ef298","avatarUrl":"/avatars/37307d1748c9308d8c879f546811bd78.svg","isPro":false,"fullname":"simon lee","user":"simonprefer","type":"user"},{"_id":"68621a1bcfa12e6d9d90e778","avatarUrl":"/avatars/aad8e9677d88f9b30ac2fa8a3bae50c3.svg","isPro":false,"fullname":"Chen JingQi","user":"KyrinChen","type":"user"},{"_id":"6470c4a33601bb7b066712b5","avatarUrl":"/avatars/c635f95dba4cee3cfe69009b6595ea86.svg","isPro":false,"fullname":"Mingshu Chen","user":"cms42","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":3,"organization":{"_id":"613b0dee83ec35d460684607","name":"OpenMOSS-Team","fullname":"OpenMOSS","avatar":"https://cdn-uploads.huggingface.co/production/uploads/61457b8deff2c9fdb4de4988/N5b9663zQ4uq5_OTNlnmw.png"}}">
Papers
arxiv:2601.01554

MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization

Published on Jan 4
· Submitted by
Zhaoye Fei
on Jan 7
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
Zhe Xu ,
,
,

Abstract

A unified multimodal large language model for end-to-end speaker-attributed, time-stamped transcription with extended context window and strong generalization across benchmarks.

AI-generated summary

Speaker-Attributed, Time-Stamped Transcription (SATS) aims to transcribe what is said and to precisely determine the timing of each speaker, which is particularly valuable for meeting transcription. Existing SATS systems rarely adopt an end-to-end formulation and are further constrained by limited context windows, weak long-range speaker memory, and the inability to output timestamps. To address these limitations, we present MOSS Transcribe Diarize, a unified multimodal large language model that jointly performs Speaker-Attributed, Time-Stamped Transcription in an end-to-end paradigm. Trained on extensive real wild data and equipped with a 128k context window for up to 90-minute inputs, MOSS Transcribe Diarize scales well and generalizes robustly. Across comprehensive evaluations, it outperforms state-of-the-art commercial systems on multiple public and in-house benchmarks.

Community

Paper author Paper submitter

MOSS Transcribe Diarize 🎙️

We introduce MOSS Transcribe Diarize — a unified multimodal model for Speaker-Attributed, Time-Stamped Transcription (SATS).

🔍 End-to-end SATS in a single pass (transcription + speaker attribution + timestamps)
🧠 128k context window for up to ~90-minute audio without chunking (strong long-range speaker memory)
🌍 Trained on extensive in-the-wild conversations + controllable simulated mixtures (robust to overlap/noise/domain shift)
📊 Strong results on AISHELL-4 / Podcast / Movies benchmarks (best cpCER / Δcp among evaluated systems)

Paper: [2601.01554] MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization
Homepage: https://mosi.cn/models/moss-transcribe-diarize
Online Demo: https://moss-transcribe-diarize-demo.mosi.cn

·

Some very interesting work. Very much hoping for at least an open-weight release 🤞.

P.s. In case you were not already aware, your website is currently down.

Verycool & useful work! Is the model going to be open source?

·
Paper author

Thank you for your interest, and we plan to open-source it in the coming months.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Really useful work, I'm excited to see the open-weight release!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.01554 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.01554 in a dataset README.md to link it from this page.

Spaces citing this paper 2

Collections including this paper 4