papers
updated
The Chosen One: Consistent Characters in Text-to-Image Diffusion Models
Paper
• 2311.10093
• Published
• 58
NeuroPrompts: An Adaptive Framework to Optimize Prompts for
Text-to-Image Generation
Paper
• 2311.12229
• Published
• 26
Diffusion Model Alignment Using Direct Preference Optimization
Paper
• 2311.12908
• Published
• 49
VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models
Paper
• 2312.00845
• Published
• 39
AnimateZero: Video Diffusion Models are Zero-Shot Image Animators
Paper
• 2312.03793
• Published
• 18
One-dimensional Adapter to Rule Them All: Concepts, Diffusion Models and
Erasing Applications
Paper
• 2312.16145
• Published
• 10
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
Paper
• 2312.16862
• Published
• 31
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
Paper
• 2312.04461
• Published
• 62
Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent
Diffusion Models for Virtual Try-All
Paper
• 2401.13795
• Published
• 68
Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic
Image Restoration In the Wild
Paper
• 2401.13627
• Published
• 78
MambaByte: Token-free Selective State Space Model
Paper
• 2401.13660
• Published
• 60
UNIMO-G: Unified Image Generation through Multimodal Conditional
Diffusion
Paper
• 2401.13388
• Published
• 13
Lumiere: A Space-Time Diffusion Model for Video Generation
Paper
• 2401.12945
• Published
• 87
Multilingual and Fully Non-Autoregressive ASR with Large Language Model
Fusion: A Comprehensive Study
Paper
• 2401.12789
• Published
• 9
Mastering Text-to-Image Diffusion: Recaptioning, Planning, and
Generating with Multimodal LLMs
Paper
• 2401.11708
• Published
• 30
Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass
Diffusion Transformers
Paper
• 2401.11605
• Published
• 23
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
Paper
• 2401.10891
• Published
• 62
DiffusionGPT: LLM-Driven Text-to-Image Generation System
Paper
• 2401.10061
• Published
• 32
Vision Mamba: Efficient Visual Representation Learning with
Bidirectional State Space Model
Paper
• 2401.09417
• Published
• 62
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and
DeepSpeed-Inference
Paper
• 2401.08671
• Published
• 15
UFO: A UI-Focused Agent for Windows OS Interaction
Paper
• 2402.07939
• Published
• 17
λ-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion
Models by Leveraging CLIP Latent Space
Paper
• 2402.05195
• Published
• 19
FiT: Flexible Vision Transformer for Diffusion Model
Paper
• 2402.12376
• Published
• 48
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K
Text-to-Image Generation
Paper
• 2403.04692
• Published
• 40
Yi: Open Foundation Models by 01.AI
Paper
• 2403.04652
• Published
• 65
StableDrag: Stable Dragging for Point-based Image Editing
Paper
• 2403.04437
• Published
• 27
CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion
Paper
• 2403.05121
• Published
• 23
Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion
Distillation
Paper
• 2403.12015
• Published
• 70
Be Yourself: Bounded Attention for Multi-Subject Text-to-Image
Generation
Paper
• 2403.16990
• Published
• 25
ViTAR: Vision Transformer with Any Resolution
Paper
• 2403.18361
• Published
• 55
LLM Agent Operating System
Paper
• 2403.16971
• Published
• 73
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper
• 2404.01197
• Published
• 31
CosmicMan: A Text-to-Image Foundation Model for Humans
Paper
• 2404.01294
• Published
• 17
On the Scalability of Diffusion-based Text-to-Image Generation
Paper
• 2404.02883
• Published
• 19
Applying Guidance in a Limited Interval Improves Sample and Distribution
Quality in Diffusion Models
Paper
• 2404.07724
• Published
• 14
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
Language Models
Paper
• 2404.07973
• Published
• 32
EdgeFusion: On-Device Text-to-Image Generation
Paper
• 2404.11925
• Published
• 23
InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation
Paper
• 2404.19427
• Published
• 74
DressCode: Autoregressively Sewing and Generating Garments from Text
Guidance
Paper
• 2401.16465
• Published
• 12
AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and
Video Generation
Paper
• 2406.07686
• Published
• 17
Wavelets Are All You Need for Autoregressive Image Generation
Paper
• 2406.19997
• Published
• 31
InstantStyle-Plus: Style Transfer with Content-Preserving in
Text-to-Image Generation
Paper
• 2407.00788
• Published
• 23
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
Supporting Long-Contextual Input and Output
Paper
• 2407.03320
• Published
• 94
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion
Paper
• 2407.01392
• Published
• 44
OmniParser for Pure Vision Based GUI Agent
Paper
• 2408.00203
• Published
• 24
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation
with Multimodal Generative Pretraining
Paper
• 2408.02657
• Published
• 35
Eliminating Oversaturation and Artifacts of High Guidance Scales in
Diffusion Models
Paper
• 2410.02416
• Published
• 34
DART: Denoising Autoregressive Transformer for Scalable Text-to-Image
Generation
Paper
• 2410.08159
• Published
• 26
Fluid: Scaling Autoregressive Text-to-image Generative Models with
Continuous Tokens
Paper
• 2410.13863
• Published
• 37
HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial
Network for High-Fidelity Speech Super-Resolution
Paper
• 2501.10045
• Published
• 10
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
Understanding
Paper
• 2501.13106
• Published
• 90
ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language
Model Born from Transformer
Paper
• 2501.15570
• Published
• 25
SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute
in Linear Diffusion Transformer
Paper
• 2501.18427
• Published
• 24