Vision
updated
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper
• 2406.16860
• Published
• 63
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper
• 2407.02477
• Published
• 24
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper
• 2408.10188
• Published
• 52
Building and better understanding vision-language models: insights and
future directions
Paper
• 2408.12637
• Published
• 133
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution
Real-World Scenarios that are Difficult for Humans?
Paper
• 2408.13257
• Published
• 26
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
• 2408.16500
• Published
• 57
VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time
Series Forecasters
Paper
• 2408.17253
• Published
• 39
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Paper
• 2409.01704
• Published
• 83
Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free
Real Image Editing
Paper
• 2409.01322
• Published
• 96
NVLM: Open Frontier-Class Multimodal LLMs
Paper
• 2409.11402
• Published
• 74
Phidias: A Generative Model for Creating 3D Content from Text, Image,
and 3D Conditions with Reference-Augmented Diffusion
Paper
• 2409.11406
• Published
• 27
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained
Vision-Language Models
Paper
• 2410.09733
• Published
• 8
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via
Hybrid Architecture
Paper
• 2409.02889
• Published
• 54
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
• 2412.08737
• Published
• 54
Progressive Multimodal Reasoning via Active Retrieval
Paper
• 2412.14835
• Published
• 73
AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models
Paper
• 2309.16414
• Published
• 19
InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity
Paper
• 2503.16418
• Published
• 36
ReSearch: Learning to Reason with Search for LLMs via Reinforcement
Learning
Paper
• 2503.19470
• Published
• 19
A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?
Paper
• 2409.15277
• Published
• 38
Unified Vision-Language-Action Model
Paper
• 2506.19850
• Published
• 27