Distilled from VJEPA feature space?

#1
by eggsbenedicto - opened

It seems like the ability to lean on VJEPA (Pretrained on what, a billion videos?) probably jumpstarted the training a lot. Nice work, will dive into the architecture more. From visual results seems comparable to Wan2.1 1.3B

Motif Technologies org

Thanks for checking it out, really appreciate it!

Yeah, V-JEPA gives a pretty strong prior on video features "distilled from JEPA" isn't a bad way to describe it. We use V-JEPA for REPA-style representation alignment, but only in the early training phase (noted in our tech report).

A couple of implementation details on why early-only:

(1) We disable REPA later in training, inspired by this paper. Interesting contrast: models like Waver go the other direction β€” REPA only from intermediate stages (480p), since REPA compute is relatively heavy during low-res / image pretraining. We went "REPA early, then off" based on that paper plus reason (2).

(2) Following this work, we worked under the assumption that dense features matter a lot for REPA to really pay off. V-JEPA (as the V-JEPA 2.1 paper notes, and as we show with examples in our tech report) isn't particularly dense-feature-rich. We didn't see the order-of-magnitude speedup the original REPA paper reported, and this is probably why. Next time around, I think the right move is a teacher that's both dense-feature-rich and has temporal compression (which V-JEPA 2.1 already does).

More details on the rest of the architecture are in the tech report if you want to dig in.

Thanks again!

Sign up or log in to comment