HyLaR: Hybrid Latent Reasoning with Decoupled Policy Optimization
Tao Cheng, Shi-Zhe Chen, Hao Zhang, Yixin Qin, Jinwen Luo, Zheng Wei
We introduce HyLaR, a training framework that enables multimodal large language models (MLLMs) to perform hybrid latent reasoning — combining textual chain-of-thought with continuous visual latent representations. HyLaR introduces a Canvas-in-Latents mechanism during supervised fine-tuning and a Decoupled Hybrid PPO algorithm during reinforcement learning, allowing the model to seamlessly interleave discrete text reasoning and continuous latent visual thinking.
🏆 Results
HyLaR achieves consistent improvements across a wide range of multimodal reasoning benchmarks. By combining textual chain-of-thought with continuous latent visual representations, our model demonstrates stronger spatial understanding, more accurate visual grounding. The following figures highlight our quantitative performance gains.
🔍Overview
Table of Contents
⚙Installation
git clone https://github.com/EthenCheng/HyLar.git
SFT Environment
conda create -n hylar-sft python=3.10
conda activate hylar-sft
cd HyLar/SFT
pip install -r requirements.txt
pip install qwen-vl-utils
pip install flash-attn --no-build-isolation
RL Environment
cd HyLar/RL
conda env create -f environment.yml
🔧SFT Training
Canvas-in-Latents
The SFT stage teaches the model to reason with Canvas-in-Latents — injecting continuous visual representations into the reasoning trace:
- Special tokens
<|canvas|>,<|canvas_start|>,<|canvas_end|>are added to the tokenizer. - A frozen vision encoder (SigLIP2) extracts patch features, which are projected to the LLM hidden dimension via a trainable linear projector.
Training Script
See train_canvas.sh.
cd SFT
bash scripts/train_canvas.sh
Implementation Details
The training requires monkey-patching the official Qwen2.5-VL forward pass, implemented in monkey_patch_forward_canvas.py. The patched forward injects canvas hidden states at designated positions and computes the canvas reconstruction loss alongside the standard language modeling loss.
🚀RL Training
Training Script
See depo_train.sh.
cd RL
bash examples/depo_train.sh
Model Merging
After training, merge FSDP sharded checkpoints into a single HuggingFace model:
bash examples/merge_model.sh
⭐Inference & Evaluation
Multi-GPU Evaluation
HyLaR supports multi-GPU parallel inference with configurable latent reasoning depth.
See run_eval_HyLar.sh:
bash Evaluate/run_eval_HyLar.sh
- Downloads last month
- 102