Liberating LLM Capabilities in Full-Duplex Speech Models
Abstract
A text-first tri-channel speech interface enables real-time interaction with visible text output alongside spoken responses, demonstrating superior performance in full-duplex conversational tasks.
Speech-based large language models are typically constrained to spoken replies, which limits their user-facing outputs to what can be verbalized and suppresses text-native capabilities such as code generation, structured analysis, and multi-step reasoning in realtime interaction, for tasks that require persistent, structured, and inspectable intermediate outputs. Existing work improves spoken reasoning or full-duplex turn-taking, but still treats text as a hidden intermediate state or a subordinate modality rather than a first-class output channel. We propose Listen-Write-Speak (LWS), a text-first tri-channel paradigm in which a single autoregressive LLM continuously listens to user audio, writes visible free-form text as its primary output, and speaks a realtime oral response in parallel under a shared causal attention context. This behavior is implemented entirely through a Token Schema, requiring no architectural modifications, and learned via a two-stage data pipeline that synthesizes per-second cognitive annotations consistent with the revealed input timeline. Empirically, LWS demonstrates strong full-duplex interaction on Full-Duplex-Bench, reaches 4.72 on VoiceBench AlpacaEval, achieves 92.6% writing-speaking consistency, and consistently outperforms its internal ablations on URO-Bench. These results suggest that visible writing can serve as a first-class output channel for speech interaction without sacrificing realtime responsiveness. The code and dataset are available on the project page: https://royalzhang.com/project/lws-page/.
Community
LWS is a simple “free lunch” for full-duplex speech models: without changing the model architecture, we add a visible writing channel through a token schema, allowing the model to speak in real time while also producing text-native outputs such as code, tables, derivations, and structured reasoning.
Project page: https://royalzhang.com/project/lws-page/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue (2026)
- DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action (2026)
- MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models (2026)
- MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction (2026)
- MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model (2026)
- UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction (2026)
- StepAudio 2.5 Technical Report (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.07547 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper