Papers
arxiv:2602.11731

Thinking with Drafting: Optical Decompression via Logical Reconstruction

Published on Feb 12
· Submitted by
Cheng Tan
on Feb 13
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

Visual reasoning is enhanced by reconstructing logical structures from compressed visual tokens through a DSL-based approach that generates deterministic visual proofs for verification.

AI-generated summary

Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.

Community

Paper author Paper submitter

The core idea of Thinking with Drafting (TwD)

is super refreshing: instead of letting a multimodal model “guess the answer” with fluent CoT or pretty-looking diagrams, it forces the model to draft its reasoning into executable structure. Not vibes. Not plausible pixels. But strict, renderable DSL code.

The “optical decompression” framing is also 🔥 — OCR gives you symbols, but not logical topology. TwD says: real understanding = reconstructing the hidden structure behind those symbols. And the moment the model has to commit to aligned segments, brackets, and cross-row constraints, hallucination becomes much harder.

What I like most is the shift from:

generate explanation → hope it’s right
to
generate structure → verify it deterministically

That feels like a big step toward trustworthy multimodal reasoning.

Paper author

Framework of Thinking with Drafting (TwD) compared to CoT

fig_introduction

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.11731 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.11731 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.11731 in a Space README.md to link it from this page.

Collections including this paper 2