SSync: Selective Synergistic Learning for Video Object-Centric Learning
ECCV 2026 · Paper · Code · Project Page
Authors: WonJun Moon (KAIST), Jae-Pil Heo (Sungkyunkwan University)
Model Description
SSync is a selective mutual-distillation framework for video object-centric learning (VOCL). Slot-based VOCL methods are guided by two spatial maps — the encoder's attention map (sharp boundaries, noisy interiors) and the decoder's object map (coherent interiors, blurry boundaries). Rather than forcing dense agreement across all spatio-temporal patches, SSync selectively distills only the most reliable cues from each map:
- Encoder → Decoder: boundary refinement via crisp attention boundaries
- Decoder → Encoder: interior denoising via coherent object maps
This is realized through a linear-complexity pseudo-labeling scheme, eliminating quadratic spatial comparisons. A transitive pseudo-label merging step further consolidates redundant slots based on spatio-temporal activation consistency, making SSync robust to slot count configurations.
Evaluation Results
Object discovery on VOCL benchmarks (averaged over 3 runs):
| Dataset | FG-ARI ↑ | mBO ↑ |
|---|---|---|
| MOVi-C (336×336) | 79.4 | 39.5 |
| MOVi-E (336×336) | 84.0 | 34.8 |
| YouTube-VIS 2021 (518×518) | 42.6 | 38.7 |
Training Data
| Dataset | Size |
|---|---|
| YouTube-VIS 2021 | 26.43 GB |
| MOVi-C | 7.43 GB |
| MOVi-E | 8.26 GB |
See data/README.md for download instructions.
Citation
@inproceedings{moon2026ssync,
title = {Selective Synergistic Learning for Video Object-Centric Learning},
author = {Moon, WonJun and Heo, Jae-Pil},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2026}
}
Acknowledgements
Built upon VideoSAUR, SlotContrast, SRL, and SlotCurri.