Distilled from VJEPA feature space?
It seems like the ability to lean on VJEPA (Pretrained on what, a billion videos?) probably jumpstarted the training a lot. Nice work, will dive into the architecture more. From visual results seems comparable to Wan2.1 1.3B
Thanks for checking it out, really appreciate it!
Yeah, V-JEPA gives a pretty strong prior on video features "distilled from JEPA" isn't a bad way to describe it. We use V-JEPA for REPA-style representation alignment, but only in the early training phase (noted in our tech report).
A couple of implementation details on why early-only:
(1) We disable REPA later in training, inspired by this paper. Interesting contrast: models like Waver go the other direction β REPA only from intermediate stages (480p), since REPA compute is relatively heavy during low-res / image pretraining. We went "REPA early, then off" based on that paper plus reason (2).
(2) Following this work, we worked under the assumption that dense features matter a lot for REPA to really pay off. V-JEPA (as the V-JEPA 2.1 paper notes, and as we show with examples in our tech report) isn't particularly dense-feature-rich. We didn't see the order-of-magnitude speedup the original REPA paper reported, and this is probably why. Next time around, I think the right move is a teacher that's both dense-feature-rich and has temporal compression (which V-JEPA 2.1 already does).
More details on the rest of the architecture are in the tech report if you want to dig in.
Thanks again!