arxiv:2603.10757

CodePercept: Code-Grounded Visual STEM Perception for MLLMs

Published on Mar 11

· Submitted by

TongkunGuan on Mar 12

Qwen

Upvote

Authors:

Tongkun Guan ,

Mingkun Yang ,

Ruize Chen ,

Abstract

MLLMs struggle with STEM visual reasoning due to perceptual limitations rather than reasoning deficiencies, and enhancing perception through code-as-perception paradigms improves performance.

AI-generated summary

When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium--executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code is available at https://github.com/TongkunGuan/Qwen-CodePercept.

View arXiv page View PDF GitHub 8 Add to collection

Community

TongkunGuan

Paper author Paper submitter about 7 hours ago

CodePercept: Code-Grounded Visual STEM Perception for MLLMs

How This Work Fits into the Broader Research Landscape

The field of Multimodal Large Language Models (MLLMs) has seen significant advancements, particularly in adapting Large Language Model (LLM) successes to multimodal domains. A prominent area of application is Science, Technology, Engineering, and Mathematics (STEM), where MLLMs are increasingly employed for complex problem-solving that requires both visual understanding and sophisticated reasoning. Prior research in this area has largely concentrated on enhancing the reasoning capabilities of MLLMs. Approaches include cold-start thinking data curation, reinforcement learning (RL) based methods with designed reward mechanisms, and unimodal thinking data transfer to improve reasoning performance.

Despite these efforts, the current work identifies a critical unaddressed question: whether MLLM failures in STEM visual reasoning are primarily attributable to perceptual deficiencies or reasoning limitations. Existing MLLM research has predominantly focused on reasoning, potentially overlooking the fundamental role of visual perception. Furthermore, methods for evaluating visual perception in STEM contexts often rely on problem-solving accuracy as an indirect proxy, which measures problem-relevant information extraction rather than a comprehensive visual understanding. This approach may miss critical visual details not directly tied to specific questions but essential for complete perception.

This paper positions itself to bridge these identified gaps by shifting the focus from reasoning to perception. It proposes that perception is a primary bottleneck in MLLM performance for STEM tasks and introduces a novel paradigm to systematically enhance and rigorously evaluate visual perception using executable code as a structured, verifiable medium. This addresses the limitations of natural language in capturing precise spatial and quantitative relationships in complex STEM visuals and offers a deterministic method for perception assessment.

Key Objectives and Motivation

The primary motivation for this research stems from an empirical observation: a systematic scaling analysis revealed that enhancing the perception component of MLLMs consistently yields greater performance improvements in STEM visual reasoning tasks compared to scaling the reasoning component. This finding indicates that perception is the current limiting factor for MLLMs in STEM domains.

Building on this insight, the key objectives of the CodePercept project are detailed as follows:

Systematically Enhance Perception Capabilities: The central objective is to develop methodologies that explicitly improve the visual perception abilities of MLLMs, particularly for STEM imagery. This is a direct response to the identified bottleneck.

Establish Code as a Perceptual Medium: The work aims to demonstrate that executable code serves as a powerful and precise medium for representing and grounding visual perception in STEM. Code’s structured nature and unambiguous semantics are posited to align well with the precise requirements of STEM visuals, overcoming the inherent ambiguities and "descriptive aphasia" of natural language for complex spatial, quantitative, and relational information.

Construct a Large-Scale, Code-Grounded Dataset: A critical objective is the creation of ICC-1M, a dataset comprising over 1 million Image-Caption-Code triplets. This dataset is designed to materialize the "code-as-perception" paradigm by providing high-quality, code-grounded training data. The dataset construction aims to mitigate issues like hallucinations prevalent in existing knowledge distillation methods that rely on natural language descriptions from advanced MLLMs.

Develop Code-Grounded Training Tasks: To leverage the ICC-1M dataset, the project aims to introduce two complementary training tasks:

Code-Grounded Caption Generation: This task aims to produce factually accurate and linguistically natural image captions by using executable code as the ground truth, thereby eliminating errors and hallucinations common in direct vision-to-text generation.

STEM Image-to-Code Translation: This task directly trains models to generate executable code for image reconstruction, which serves to enhance the model’s understanding of visual elements in a structured, verifiable manner.

Introduce a Novel, Verifiable Perception Benchmark: The research aims to establish STEM2Code-Eval, a new benchmark specifically designed to directly and deterministically evaluate visual perception in STEM domains. Unlike existing benchmarks that rely on problem-solving accuracy as an indirect proxy, STEM2Code-Eval requires models to generate executable code for image reconstruction, providing a direct and verifiable assessment of comprehensive visual comprehension.

In summary, the motivation is to address a fundamental limitation in MLLMs for STEM by targeting perception as the core issue, and the objectives are to systematically tackle this through code-grounded data generation, training, and evaluation.

Main Findings and Results

The research presents several key findings and results supporting the CodePercept paradigm:

3.1. Perception as the Primary Bottleneck

A systematic scaling analysis was conducted by decoupling STEM visual reasoning into a perception stage (image-to-caption) and a reasoning stage (caption-to-answer).

Results consistently demonstrated that scaling perception capabilities (e.g., from 4B to 32B parameters for the perception component, while keeping reasoning fixed) yielded greater performance gains in STEM problem-solving compared to scaling reasoning capabilities (e.g., scaling the reasoning component while keeping perception fixed). This empirical evidence supports the hypothesis that perception is the current limiting factor in STEM visual reasoning for MLLMs.

3.2. Problem-solving Perception Evaluation

CodePercept models were evaluated as image captioners using a captioner-solver setup across six STEM reasoning benchmarks (MathVision, MathVista, MathVerse, DynaMath, WeMath, LogicVista). The generated captions were then fed to a fixed LLM solver (Qwen3-30A3-Thinking or Qwen3-235A22-Thinking) to generate answers.

Consistent Improvements: CodePercept-S1 models demonstrated consistent and substantial improvements over baseline MLLMs. For example, CodePercept-4B-S1 outperformed Qwen3-VL-4B-Instruct by 2.8% average, while CodePercept-8B-S1 showed a 3.0% gain using the Qwen3-30A3-Thinking solver.

Robustness Across Solvers: Similar gains (2.9% and 3.4% for 4B and 8B, respectively) were observed with the more powerful Qwen3-235A22-Thinking solver, indicating robustness across different reasoning capabilities.

Competitive Performance: CodePercept-8B-S1 surpassed several larger models, including Qwen2.5-VL-72B (by 6.2%), and approached the performance of frontier models such as Claude-Opus 4.1-Thinking and GPT5-Thinking. These results validate that code-grounded perception enhances STEM visual perception capabilities that translate to improved downstream reasoning.

3.3. Image Reproduce Perception Evaluation (STEM2Code-Eval)

The STEM2Code-Eval benchmark, which directly assesses visual perception by requiring models to generate executable code for image reconstruction, was used to evaluate CodePercept models. Performance was measured by Image Score, Code Score, and Execution Rate.

Significant S1 Gains: CodePercept-S1 models, trained on joint code-grounded tasks, achieved substantial improvements. CodePercept-4B-S1 reached an average score of 54.09, an increase of 10.6 points over Qwen3-VL-4B-Instruct. CodePercept-8B-S1 achieved 59.64, an increase of 12.3 points over its baseline. These results indicate that grounding perception in executable code fundamentally enhances visual comprehension.

Further RL Optimization: The CodePercept-R1 models, optimized with reinforcement learning, showed further gains. CodePercept-4B-R1 achieved 61.44 (+7.35 over S1), and CodePercept-8B-R1 reached 63.56 (+3.92 over S1).

Superiority over Large Models: These RL-optimized models surpassed several super-large MLLMs, including Seed1.6-Vision and Qwen3-VL-Plus, in direct visual perception capabilities.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.10757 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.10757 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.10757 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.