Interest
updated
CompCap: Improving Multimodal Large Language Models with Composite
Captions
Paper
• 2412.05243
• Published
• 20
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
Paper
• 2412.04814
• Published
• 46
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at
Scale
Paper
• 2412.05237
• Published
• 46
Exploring Multi-Grained Concept Annotations for Multimodal Large
Language Models
Paper
• 2412.05939
• Published
• 15
Chimera: Improving Generalist Model with Domain-Specific Experts
Paper
• 2412.05983
• Published
• 9
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance
Paper
• 2412.06673
• Published
• 11
Video Motion Transfer with Diffusion Transformers
Paper
• 2412.07776
• Published
• 17
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
Paper
• 2412.03548
• Published
• 17
Frame Representation Hypothesis: Multi-Token LLM Interpretability and
Concept-Guided Text Generation
Paper
• 2412.07334
• Published
• 17
StreamChat: Chatting with Streaming Video
Paper
• 2412.08646
• Published
• 18
SAME: Learning Generic Language-Guided Visual Navigation with
State-Adaptive Mixture of Experts
Paper
• 2412.05552
• Published
• 6
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary
Embedding Distillation
Paper
• 2412.09585
• Published
• 11
Multimodal Latent Language Modeling with Next-Token Diffusion
Paper
• 2412.08635
• Published
• 49
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
• 2412.08737
• Published
• 54
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for
Long-term Streaming Video and Audio Interactions
Paper
• 2412.09596
• Published
• 97
VideoICL: Confidence-based Iterative In-context Learning for
Out-of-Distribution Video Understanding
Paper
• 2412.02186
• Published
• 23
Fourier Position Embedding: Enhancing Attention's Periodic Extension for
Length Generalization
Paper
• 2412.17739
• Published
• 41
Large Concept Models: Language Modeling in a Sentence Representation
Space
Paper
• 2412.08821
• Published
• 17
The GAN is dead; long live the GAN! A Modern GAN Baseline
Paper
• 2501.05441
• Published
• 95
Tarsier2: Advancing Large Vision-Language Models from Detailed Video
Description to Comprehensive Video Understanding
Paper
• 2501.07888
• Published
• 15
Temporal Preference Optimization for Long-Form Video Understanding
Paper
• 2501.13919
• Published
• 23
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
Understanding
Paper
• 2501.13106
• Published
• 90
PaSa: An LLM Agent for Comprehensive Academic Paper Search
Paper
• 2501.10120
• Published
• 54
PokerBench: Training Large Language Models to become Professional Poker
Players
Paper
• 2501.08328
• Published
• 19
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper
• 2501.08313
• Published
• 300
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising
Steps
Paper
• 2501.09732
• Published
• 72
Evolving Deeper LLM Thinking
Paper
• 2501.09891
• Published
• 115
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Paper
• 2501.09781
• Published
• 27
Transformers without Normalization
Paper
• 2503.10622
• Published
• 170
Towards Understanding Camera Motions in Any Video
Paper
• 2504.15376
• Published
• 155