HyLaR: Hybrid Latent Reasoning with Decoupled Policy Optimization

Tao Cheng, Shi-Zhe Chen, Hao Zhang, Yixin Qin, Jinwen Luo, Zheng Wei

Paper PDF HF Model: HyLaR-7B HF Data: HyLaR-SFT-Data HF Data: HyLaR-RL-Data

Overview

We introduce HyLaR, a training framework that enables multimodal large language models (MLLMs) to perform hybrid latent reasoning — combining textual chain-of-thought with continuous visual latent representations. HyLaR introduces a Canvas-in-Latents mechanism during supervised fine-tuning and a Decoupled Hybrid PPO algorithm during reinforcement learning, allowing the model to seamlessly interleave discrete text reasoning and continuous latent visual thinking.

🏆 Results

HyLaR achieves consistent improvements across a wide range of multimodal reasoning benchmarks. By combining textual chain-of-thought with continuous latent visual representations, our model demonstrates stronger spatial understanding, more accurate visual grounding. The following figures highlight our quantitative performance gains.

Results on High-Resolution Image Perception and Visual Search Benchmarks.The best-performing latent reasoning model is highlighted in bold. (* Reproduced via our evaluation pipeline for fair comparison; original reported scores are in gray.)

Results on Multimodal VQA, Reasoning and Hallucination Benchmarks. The best-performing latent reasoning model for each dataset is highlighted in bold.

🔍Overview

Table of Contents
  1. Installation
  2. SFT Training
  3. RL Training
  4. Inference & Evaluation

⚙Installation

git clone https://github.com/EthenCheng/HyLar.git

SFT Environment

conda create -n hylar-sft python=3.10
conda activate hylar-sft
cd HyLar/SFT
pip install -r requirements.txt
pip install qwen-vl-utils
pip install flash-attn --no-build-isolation

RL Environment

cd HyLar/RL
conda env create -f environment.yml

🔧SFT Training

Canvas-in-Latents

The SFT stage teaches the model to reason with Canvas-in-Latents — injecting continuous visual representations into the reasoning trace:

  • Special tokens <|canvas|>, <|canvas_start|>, <|canvas_end|> are added to the tokenizer.
  • A frozen vision encoder (SigLIP2) extracts patch features, which are projected to the LLM hidden dimension via a trainable linear projector.

Training Script

See train_canvas.sh.

cd SFT
bash scripts/train_canvas.sh

Implementation Details

The training requires monkey-patching the official Qwen2.5-VL forward pass, implemented in monkey_patch_forward_canvas.py. The patched forward injects canvas hidden states at designated positions and computes the canvas reconstruction loss alongside the standard language modeling loss.

🚀RL Training

Training Script

See depo_train.sh.

cd RL
bash examples/depo_train.sh

Model Merging

After training, merge FSDP sharded checkpoints into a single HuggingFace model:

bash examples/merge_model.sh

⭐Inference & Evaluation

Multi-GPU Evaluation

HyLaR supports multi-GPU parallel inference with configurable latent reasoning depth.

See run_eval_HyLar.sh:

bash Evaluate/run_eval_HyLar.sh
Downloads last month
102
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TencentBAC/HyLaR-Qwen2.5-VL-7B

Quantizations
1 model

Collection including TencentBAC/HyLaR-Qwen2.5-VL-7B

Paper for TencentBAC/HyLaR-Qwen2.5-VL-7B