Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
\n","updatedAt":"2026-01-06T03:22:35.963Z","author":{"_id":"64b796079ebb7e6c7ddcdabf","avatarUrl":"/avatars/51af43cab078705b8745b4f942f542e5.svg","fullname":"Liao Qu","name":"leo1117","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.3497672379016876},"editors":["leo1117"],"editorAvatarUrls":["/avatars/51af43cab078705b8745b4f942f542e5.svg"],"reactions":[],"isReport":false}},{"id":"695db8da8a8cf724c8b95819","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-01-07T01:37:30.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation](https://huggingface.co/papers/2511.18262) (2025)\n* [STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning](https://huggingface.co/papers/2512.13752) (2025)\n* [Visual Generation Tuning](https://huggingface.co/papers/2511.23469) (2025)\n* [Loom: Diffusion-Transformer for Interleaved Generation](https://huggingface.co/papers/2512.18254) (2025)\n* [What Happens Next? Next Scene Prediction with a Unified Video Model](https://huggingface.co/papers/2512.13015) (2025)\n* [Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing](https://huggingface.co/papers/2512.17909) (2025)\n* [VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction](https://huggingface.co/papers/2511.23386) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
\n","updatedAt":"2026-01-07T03:25:12.140Z","author":{"_id":"651b7b80949d1e0f528b518d","avatarUrl":"/avatars/a8c30b65ad701330ceb0adfd97a32810.svg","fullname":"LaVern Dearborn","name":"vernie6017","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7088169455528259},"editors":["vernie6017"],"editorAvatarUrls":["/avatars/a8c30b65ad701330ceb0adfd97a32810.svg"],"reactions":[],"isReport":false}},{"id":"696b901385619ece0dd0d897","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2026-01-17T13:35:15.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXivlens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/nextflow-unified-sequential-modeling-activates-multimodal-understanding-and-generation-8305-bfb5e4d0\n\n- Executive Summary\n- Detailed Breakdown\n- Practical Applications","html":"
\n","updatedAt":"2026-01-17T13:35:15.751Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6898414492607117},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.02204","authors":[{"_id":"695c7d0d6aa73bc11f091433","name":"Huichao Zhang","hidden":false},{"_id":"695c7d0d6aa73bc11f091434","user":{"_id":"64b796079ebb7e6c7ddcdabf","avatarUrl":"/avatars/51af43cab078705b8745b4f942f542e5.svg","isPro":false,"fullname":"Liao Qu","user":"leo1117","type":"user"},"name":"Liao Qu","status":"claimed_verified","statusLastChangedAt":"2026-01-06T09:57:57.686Z","hidden":false},{"_id":"695c7d0d6aa73bc11f091435","name":"Yiheng Liu","hidden":false},{"_id":"695c7d0d6aa73bc11f091436","name":"Hang Chen","hidden":false},{"_id":"695c7d0d6aa73bc11f091437","name":"Yangyang Song","hidden":false},{"_id":"695c7d0d6aa73bc11f091438","name":"Yongsheng Dong","hidden":false},{"_id":"695c7d0d6aa73bc11f091439","name":"Shikun Sun","hidden":false},{"_id":"695c7d0d6aa73bc11f09143a","name":"Xian Li","hidden":false},{"_id":"695c7d0d6aa73bc11f09143b","name":"Xu Wang","hidden":false},{"_id":"695c7d0d6aa73bc11f09143c","user":{"_id":"6344dcb1cd37e44d9ed46508","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6344dcb1cd37e44d9ed46508/J92UKSxKR3iziD2WJfih4.jpeg","isPro":false,"fullname":"Yi Jiang","user":"JiangYi","type":"user"},"name":"Yi Jiang","status":"claimed_verified","statusLastChangedAt":"2026-01-06T09:57:55.158Z","hidden":false},{"_id":"695c7d0d6aa73bc11f09143d","name":"Hu Ye","hidden":false},{"_id":"695c7d0d6aa73bc11f09143e","name":"Bo Chen","hidden":false},{"_id":"695c7d0d6aa73bc11f09143f","name":"Yiming Gao","hidden":false},{"_id":"695c7d0d6aa73bc11f091440","name":"Peng Liu","hidden":false},{"_id":"695c7d0d6aa73bc11f091441","name":"Akide Liu","hidden":false},{"_id":"695c7d0d6aa73bc11f091442","name":"Zhipeng Yang","hidden":false},{"_id":"695c7d0d6aa73bc11f091443","name":"Qili Deng","hidden":false},{"_id":"695c7d0d6aa73bc11f091444","name":"Linjie Xing","hidden":false},{"_id":"695c7d0d6aa73bc11f091445","name":"Jiyang Liu","hidden":false},{"_id":"695c7d0d6aa73bc11f091446","name":"Zhao Wang","hidden":false},{"_id":"695c7d0d6aa73bc11f091447","name":"Yang Zhou","hidden":false},{"_id":"695c7d0d6aa73bc11f091448","name":"Mingcong Liu","hidden":false},{"_id":"695c7d0d6aa73bc11f091449","name":"Yi Zhang","hidden":false},{"_id":"695c7d0d6aa73bc11f09144a","name":"Qian He","hidden":false},{"_id":"695c7d0d6aa73bc11f09144b","name":"Xiwei Hu","hidden":false},{"_id":"695c7d0d6aa73bc11f09144c","name":"Zhongqi Qi","hidden":false},{"_id":"695c7d0d6aa73bc11f09144d","name":"Jie Shao","hidden":false},{"_id":"695c7d0d6aa73bc11f09144e","name":"Zhiye Fu","hidden":false},{"_id":"695c7d0d6aa73bc11f09144f","name":"Shuai Wang","hidden":false},{"_id":"695c7d0d6aa73bc11f091450","name":"Fangmin Chen","hidden":false},{"_id":"695c7d0d6aa73bc11f091451","name":"Xuezhi Chai","hidden":false},{"_id":"695c7d0d6aa73bc11f091452","name":"Zhihua Wu","hidden":false},{"_id":"695c7d0d6aa73bc11f091453","name":"Yitong Wang","hidden":false},{"_id":"695c7d0d6aa73bc11f091454","name":"Zehuan Yuan","hidden":false},{"_id":"695c7d0d6aa73bc11f091455","name":"Daniel K. Du","hidden":false},{"_id":"695c7d0d6aa73bc11f091456","name":"Xinglong Wu","hidden":false}],"publishedAt":"2026-01-05T15:27:04.000Z","submittedOnDailyAt":"2026-01-06T00:52:35.953Z","title":"NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation","submittedOnDailyBy":{"_id":"64b796079ebb7e6c7ddcdabf","avatarUrl":"/avatars/51af43cab078705b8745b4f942f542e5.svg","isPro":false,"fullname":"Liao Qu","user":"leo1117","type":"user"},"summary":"We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.","upvotes":62,"discussionId":"695c7d0d6aa73bc11f091457","githubRepo":"https://github.com/ByteVisionLab/NextFlow","githubRepoAddedBy":"user","ai_summary":"NextFlow is a unified decoder-only autoregressive transformer that processes interleaved text-image tokens, enabling fast multimodal generation through novel next-token and next-scale prediction strategies.","ai_keywords":["decoder-only autoregressive transformer","interleaved text-image discrete tokens","unified vision representation","multimodal understanding","multimodal generation","next-token prediction","next-scale prediction","raster-scan methods","visual generation","prefix-tuning strategy","reinforcement learning","diffusion baselines"],"githubStars":310,"organization":{"_id":"653b817d32c97d0655575872","name":"ByteDance","fullname":"ByteDance","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6535c9e88bde2fae19b6fb25/0clr54wj5Ly-RkYU9OXPp.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64b796079ebb7e6c7ddcdabf","avatarUrl":"/avatars/51af43cab078705b8745b4f942f542e5.svg","isPro":false,"fullname":"Liao Qu","user":"leo1117","type":"user"},{"_id":"6344dcb1cd37e44d9ed46508","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6344dcb1cd37e44d9ed46508/J92UKSxKR3iziD2WJfih4.jpeg","isPro":false,"fullname":"Yi Jiang","user":"JiangYi","type":"user"},{"_id":"640feac9f2d7c41a1ea14d90","avatarUrl":"/avatars/e621dcf4196c7b3754c77dc9fdfbd438.svg","isPro":false,"fullname":"Eason liu","user":"yoloyo","type":"user"},{"_id":"636241c6b64a56694266509f","avatarUrl":"/avatars/90bf6156b649b78db13f08d0770951ad.svg","isPro":false,"fullname":"zhanghuichao","user":"liqingzju","type":"user"},{"_id":"64196d2fed725fef6441dd19","avatarUrl":"/avatars/d47e7949bf599f9e653d5777bad690e5.svg","isPro":false,"fullname":"Bo Chen","user":"Bo94","type":"user"},{"_id":"6424329973f7a0d40b303039","avatarUrl":"/avatars/639109322661a379a57b86400bdf16dc.svg","isPro":false,"fullname":"Chen Hang","user":"aachenhang","type":"user"},{"_id":"6319b24609baf858241f026c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6319b24609baf858241f026c/nBucmn2eT498BLq0MNcqo.png","isPro":false,"fullname":"xiaohu","user":"h94","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"6319b118615c77c25d659279","avatarUrl":"/avatars/a11d064945056134aec4b146edbd1f9a.svg","isPro":false,"fullname":"Skirrey","user":"Skirrey","type":"user"},{"_id":"64d6268bc2eedf9af8182519","avatarUrl":"/avatars/d9bfd969bb91a261265148491c17e633.svg","isPro":false,"fullname":"gao","user":"ym9","type":"user"},{"_id":"67b56d51a2db498e5c0414ef","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/kXx8aBwRaZ5jMa8MlXFWr.png","isPro":false,"fullname":"CHENSY","user":"Siyingcc","type":"user"},{"_id":"63841db05dfbee9bd0e0ddf7","avatarUrl":"/avatars/a03189b09071be9e068ca09990906843.svg","isPro":false,"fullname":"Monster0521","user":"litmonster0521","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"653b817d32c97d0655575872","name":"ByteDance","fullname":"ByteDance","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6535c9e88bde2fae19b6fb25/0clr54wj5Ly-RkYU9OXPp.png"}}">
NextFlow is a unified decoder-only autoregressive transformer that processes interleaved text-image tokens, enabling fast multimodal generation through novel next-token and next-scale prediction strategies.
AI-generated summary
We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.