Multiple Subject Reference IC-LoRA (Test Version)
⚠️ This is a test version released for feedback collection to guide future optimization.
Overview
This model implements a novel approach to multi-reference video generation using Multiple Subject Reference (MSR). Instead of introducing additional encoder branches or fusion modules, we transform multiple static reference images into a pseudo-video sequence that shares the same representation space as the target video.
Usage
This LoRA requires the ComfyUI-Licon-MSR plugin for ComfyUI. A sample workflow is included in the model files for easy testing and experimentation.
Key Features
Multi-Reference Visual Memory
- Token-level reference preservation: Multiple reference images are encoded as video latents, preserving fine-grained visual information at token level rather than compressing into a single embedding
- Native self-attention retrieval: The target video tokens directly access reference tokens through the model's existing self-attention mechanism—no new architectural components needed
- In-context conditioning: References serve as "visual memory" within the main token sequence, not as external conditioning inputs
Flexible Reference Composition
- 2 to 5 reference images: Supports varying numbers of reference inputs with increasing complexity
- Complementary semantic roles: Each reference image can carry different information:
- Subject identity
- Object/prop details
- Scene/background
- Local textures
- Multiple viewpoints
What It Can Do
Identity Preservation Across References
Generate videos where multiple reference identities are simultaneously preserved:
- Multiple characters from different reference images
- Character + object combinations
- Object + scene compositions
Relation-Based Composition
Beyond mere identity preservation, the model can compose references based on textual relation descriptions:
- Action interactions (handing, picking up, pushing)
- Spatial relationships (left-right, foreground-background)
- Temporal event structures (start → process → result)
Cross-Reference Attribute Selection
The model learns to selectively retrieve attributes from different references:
- Face from reference A, clothing from reference B
- Object identity from one reference, pose/position from another
- Background elements from scene references
Current Issues (Test Version)
- High-motion limb distortion: Significant degradation in limb quality during fast or complex motion sequences
- Slight object consistency loss: Minor identity drift for objects throughout the video duration