Submitted by Haiwen Diao 70 From Pixels to Words -- Towards Native Vision-Language Primitives at Scale SenseTime 808 2