HyLaR: Hybrid Latent Reasoning with Decoupled Policy Optimization

Tao Cheng, Shi-Zhe Chen, Hao Zhang, Yixin Qin, Jinwen Luo, Zheng Wei

Overview

We introduce HyLaR, a training framework that enables multimodal large language models (MLLMs) to perform hybrid latent reasoning — combining textual chain-of-thought with continuous visual latent representations. HyLaR introduces a Canvas-in-Latents mechanism during supervised fine-tuning and a Decoupled Hybrid PPO algorithm during reinforcement learning, allowing the model to seamlessly interleave discrete text reasoning and continuous latent visual thinking.

🏆 Results

HyLaR achieves consistent improvements across a wide range of multimodal reasoning benchmarks. By combining textual chain-of-thought with continuous latent visual representations, our model demonstrates stronger spatial understanding, more accurate visual grounding. The following figures highlight our quantitative performance gains.

Results on High-Resolution Image Perception and Visual Search Benchmarks.The best-performing latent reasoning model is highlighted in bold. (* Reproduced via our evaluation pipeline for fair comparison; original reported scores are in gray.)

Results on Multimodal VQA, Reasoning and Hallucination Benchmarks. The best-performing latent reasoning model for each dataset is highlighted in bold.

🔍Overview

Table of Contents

Installation
SFT Training
RL Training
Inference & Evaluation

⚙Installation

git clone https://github.com/EthenCheng/HyLar.git

SFT Environment

conda create -n hylar-sft python=3.10
conda activate hylar-sft
cd HyLar/SFT
pip install -r requirements.txt
pip install qwen-vl-utils
pip install flash-attn --no-build-isolation

RL Environment

cd HyLar/RL
conda env create -f environment.yml

🔧SFT Training

Canvas-in-Latents

The SFT stage teaches the model to reason with Canvas-in-Latents — injecting continuous visual representations into the reasoning trace:

Special tokens <|canvas|>, <|canvas_start|>, <|canvas_end|> are added to the tokenizer.
A frozen vision encoder (SigLIP2) extracts patch features, which are projected to the LLM hidden dimension via a trainable linear projector.

Training Script

See train_canvas.sh.

cd SFT
bash scripts/train_canvas.sh

Implementation Details

The training requires monkey-patching the official Qwen2.5-VL forward pass, implemented in monkey_patch_forward_canvas.py. The patched forward injects canvas hidden states at designated positions and computes the canvas reconstruction loss alongside the standard language modeling loss.