Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Thinking with Drafting: Optical Decompression via Logical Reconstruction
\n","updatedAt":"2026-02-13T08:15:19.649Z","author":{"_id":"640f7083208821a59b74c757","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678735253848-640f7083208821a59b74c757.jpeg","fullname":"Siyuan Li","name":"Lupin1998","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.633520245552063},"editors":["Lupin1998"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1678735253848-640f7083208821a59b74c757.jpeg"],"reactions":[{"reaction":"👀","users":["Lupin1998","chengtan9907","liaao","Dracozzz","Pipicat98","Jerry-98"],"count":6},{"reaction":"🔥","users":["chengtan9907","liaao","Dracozzz"],"count":3}],"isReport":false}},{"id":"698fd25c3883cdc4e0cf7812","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-14T01:39:40.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [VisTIRA: Closing the Image-Text Modality Gap in Visual Math Reasoning via Structured Tool Integration](https://huggingface.co/papers/2601.14440) (2026)\n* [Beyond Accuracy: Evaluating Grounded Visual Evidence in Thinking with Images](https://huggingface.co/papers/2601.11633) (2026)\n* [Unified Thinker: A General Reasoning Modular Core for Image Generation](https://huggingface.co/papers/2601.03127) (2026)\n* [LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning](https://huggingface.co/papers/2601.10129) (2026)\n* [TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning](https://huggingface.co/papers/2601.16520) (2026)\n* [CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving](https://huggingface.co/papers/2601.01874) (2026)\n* [UReason: Benchmarking the Reasoning Paradox in Unified Multimodal Models](https://huggingface.co/papers/2602.08336) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2026-02-14T01:39:40.032Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7348018288612366},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.11731","authors":[{"_id":"698eb9d9cace060ff123aea4","name":"Jingxuan Wei","hidden":false},{"_id":"698eb9d9cace060ff123aea5","name":"Honghao He","hidden":false},{"_id":"698eb9d9cace060ff123aea6","name":"Caijun Jia","hidden":false},{"_id":"698eb9d9cace060ff123aea7","user":{"_id":"640f7083208821a59b74c757","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678735253848-640f7083208821a59b74c757.jpeg","isPro":false,"fullname":"Siyuan Li","user":"Lupin1998","type":"user"},"name":"Siyuan Li","status":"claimed_verified","statusLastChangedAt":"2026-02-13T09:36:19.311Z","hidden":false},{"_id":"698eb9d9cace060ff123aea8","name":"Zheng Sun","hidden":false},{"_id":"698eb9d9cace060ff123aea9","user":{"_id":"691ac24a4f16b95a50f482bb","avatarUrl":"/avatars/4200eb1c92f67b24b7f29007e9105661.svg","isPro":false,"fullname":"Yuhang Xu","user":"Dracozzz","type":"user"},"name":"Yuhang Xu","status":"claimed_verified","statusLastChangedAt":"2026-02-17T15:51:13.635Z","hidden":false},{"_id":"698eb9d9cace060ff123aeaa","name":"Yuanyuan Lin","hidden":false},{"_id":"698eb9d9cace060ff123aeab","name":"Linzhuang Sun","hidden":false},{"_id":"698eb9d9cace060ff123aeac","name":"Yuchen Wu","hidden":false},{"_id":"698eb9d9cace060ff123aead","name":"Bihui Yu","hidden":false},{"_id":"698eb9d9cace060ff123aeae","name":"Xiangxiang Zhang","hidden":false},{"_id":"698eb9d9cace060ff123aeaf","user":{"_id":"64be296a46cc3cdfbb057f7e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64be296a46cc3cdfbb057f7e/jSHeNY2AcPifCZzJyFhr4.jpeg","isPro":false,"fullname":"Cheng Tan","user":"chengtan9907","type":"user"},"name":"Cheng Tan","status":"claimed_verified","statusLastChangedAt":"2026-02-13T09:36:16.992Z","hidden":false}],"publishedAt":"2026-02-12T08:54:02.000Z","submittedOnDailyAt":"2026-02-13T05:37:54.762Z","title":"Thinking with Drafting: Optical Decompression via Logical Reconstruction","submittedOnDailyBy":{"_id":"64be296a46cc3cdfbb057f7e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64be296a46cc3cdfbb057f7e/jSHeNY2AcPifCZzJyFhr4.jpeg","isPro":false,"fullname":"Cheng Tan","user":"chengtan9907","type":"user"},"summary":"Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.","upvotes":32,"discussionId":"698eb9dacace060ff123aeb0","ai_summary":"Visual reasoning is enhanced by reconstructing logical structures from compressed visual tokens through a DSL-based approach that generates deterministic visual proofs for verification.","ai_keywords":["multimodal large language models","visual perception","visual generation","optical decompression","visual tokens","Domain-Specific Language","visual algebra benchmark","visual reasoning","logical topology","deterministic visual proofs"],"organization":{"_id":"653b817d32c97d0655575872","name":"ByteDance","fullname":"ByteDance","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6535c9e88bde2fae19b6fb25/0clr54wj5Ly-RkYU9OXPp.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64be296a46cc3cdfbb057f7e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64be296a46cc3cdfbb057f7e/jSHeNY2AcPifCZzJyFhr4.jpeg","isPro":false,"fullname":"Cheng Tan","user":"chengtan9907","type":"user"},{"_id":"640f7083208821a59b74c757","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678735253848-640f7083208821a59b74c757.jpeg","isPro":false,"fullname":"Siyuan Li","user":"Lupin1998","type":"user"},{"_id":"673d4edb95a3ab68e0b0a722","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/VJOHOI4idDLHCPdZBt_do.png","isPro":false,"fullname":"xi","user":"xixii-haha","type":"user"},{"_id":"68ca802ee4b4f3be6800bbfd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/kMcvQTYaCjp22_UzdHgRT.png","isPro":false,"fullname":"Lyo","user":"Bessie311","type":"user"},{"_id":"698ee00813edf0c9b697517b","avatarUrl":"/avatars/a97de0bc736926dbff991e836ffd6cfa.svg","isPro":false,"fullname":"Niangao Pendragon","user":"Niangaoa","type":"user"},{"_id":"698ee331e5c4aa8d9747c4e5","avatarUrl":"/avatars/036bd33f5c7b5cfa623b8fcbce51093a.svg","isPro":false,"fullname":"1","user":"22ll","type":"user"},{"_id":"68d8b91832316f543ea7aa5d","avatarUrl":"/avatars/57d8623058118f8012b517ced4fc5689.svg","isPro":false,"fullname":"xvxinglong","user":"xvxinglong","type":"user"},{"_id":"66e79f11498799255a91239f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/HtG-04Xqyd4bZzpKZFjTK.png","isPro":false,"fullname":"Binyu Xie","user":"Alouette-1379","type":"user"},{"_id":"66aa39349238d9c3a1c7f9dc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66aa39349238d9c3a1c7f9dc/mj6r7uxEYXM502x296UMf.jpeg","isPro":false,"fullname":"Xin Jin","user":"Xin1118","type":"user"},{"_id":"63578f79a1f8ad1c31bd2148","avatarUrl":"/avatars/e91cf0a7c71a9533556267e67bf0610f.svg","isPro":false,"fullname":"Y","user":"CY-7","type":"user"},{"_id":"683f2e9fa073d45457ce420d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/g2WWieHoqAeG8gb1qWL5J.png","isPro":false,"fullname":"Jason Lee","user":"Jerry-98","type":"user"},{"_id":"649f9edae634fdbf5d2b9991","avatarUrl":"/avatars/2c103aa5952869c72a821b77f718f950.svg","isPro":false,"fullname":"FufengZhou","user":"myturing","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"653b817d32c97d0655575872","name":"ByteDance","fullname":"ByteDance","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6535c9e88bde2fae19b6fb25/0clr54wj5Ly-RkYU9OXPp.png"}}">
Visual reasoning is enhanced by reconstructing logical structures from compressed visual tokens through a DSL-based approach that generates deterministic visual proofs for verification.
AI-generated summary
Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.
is super refreshing: instead of letting a multimodal model “guess the answer” with fluent CoT or pretty-looking diagrams, it forces the model to draft its reasoning into executable structure. Not vibes. Not plausible pixels. But strict, renderable DSL code.
The “optical decompression” framing is also 🔥 — OCR gives you symbols, but not logical topology. TwD says: real understanding = reconstructing the hidden structure behind those symbols. And the moment the model has to commit to aligned segments, brackets, and cross-row constraints, hallucination becomes much harder.
What I like most is the shift from:
generate explanation → hope it’s right to generate structure → verify it deterministically
That feels like a big step toward trustworthy multimodal reasoning.