CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
Abstract
CLEAR is a framework that enhances multimodal model robustness to image degradation by integrating generation and reasoning through supervised fine-tuning, latent representation bridging, and interleaved reinforcement learning.
Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. Unified multimodal models that combine understanding and generation within a single architecture are a natural fit for this challenge, as their generative pathway can model the fine-grained visual structure that degradation destroys. Yet these models fail to leverage their own generative capacity on degraded inputs. We trace this disconnect to two compounding factors: existing training regimes never ask the model to invoke generation during reasoning, and the standard decode-reencode pathway does not support effective joint optimization. We present CLEAR, a framework that connects the two capabilities through three progressive steps: (1) supervised fine-tuning on a degradation-aware dataset to establish the generate-then-answer reasoning pattern; (2) a Latent Representation Bridge that replaces the decode-reencode detour with a direct, optimizable connection between generation and reasoning; (3) Interleaved GRPO, a reinforcement learning method that jointly optimizes text reasoning and visual generation under answer-correctness rewards. We construct MMD-Bench, covering three degradation severity levels across six standard multimodal benchmarks. Experiments show that CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance. Our analysis further reveals that removing pixel-level reconstruction supervision leads to intermediate visual states with higher perceptual quality, suggesting that task-driven optimization and visual quality are naturally aligned.
Community
Hello everyone, welcome to follow our work: CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UniG2U-Bench: Do Unified Models Advance Multimodal Understanding? (2026)
- DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing (2026)
- InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing (2026)
- Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models (2026)
- RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations (2026)
- Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization (2026)
- CrystaL: Spontaneous Emergence of Visual Latents in MLLMs (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.04780 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper