SSync: Selective Synergistic Learning for Video Object-Centric Learning

ECCV 2026 · Paper · Code · Project Page

Authors: WonJun Moon (KAIST), Jae-Pil Heo (Sungkyunkwan University)

Model Description

SSync is a selective mutual-distillation framework for video object-centric learning (VOCL). Slot-based VOCL methods are guided by two spatial maps — the encoder's attention map (sharp boundaries, noisy interiors) and the decoder's object map (coherent interiors, blurry boundaries). Rather than forcing dense agreement across all spatio-temporal patches, SSync selectively distills only the most reliable cues from each map:

Encoder → Decoder: boundary refinement via crisp attention boundaries
Decoder → Encoder: interior denoising via coherent object maps

This is realized through a linear-complexity pseudo-labeling scheme, eliminating quadratic spatial comparisons. A transitive pseudo-label merging step further consolidates redundant slots based on spatio-temporal activation consistency, making SSync robust to slot count configurations.

Evaluation Results

Object discovery on VOCL benchmarks (averaged over 3 runs):

Dataset	FG-ARI ↑	mBO ↑
MOVi-C (336×336)	79.4	39.5
MOVi-E (336×336)	84.0	34.8
YouTube-VIS 2021 (518×518)	42.6	38.7

Training Data

Dataset	Size
YouTube-VIS 2021	26.43 GB
MOVi-C	7.43 GB
MOVi-E	8.26 GB

See data/README.md for download instructions.

Citation

@inproceedings{moon2026ssync,
  title     = {Selective Synergistic Learning for Video Object-Centric Learning},
  author    = {Moon, WonJun and Heo, Jae-Pil},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

Acknowledgements

Built upon VideoSAUR, SlotContrast, SRL, and SlotCurri.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for WJ0830/SSync

Selective Synergistic Learning for Video Object-Centric Learning

Paper • 2606.15527 • Published 8 days ago • 3