Multiple Subject Reference IC-LoRA (Test Version)

⚠️ This is a test version released for feedback collection to guide future optimization.

Overview

This model implements a novel approach to multi-reference video generation using Multiple Subject Reference (MSR). Instead of introducing additional encoder branches or fusion modules, we transform multiple static reference images into a pseudo-video sequence that shares the same representation space as the target video.

Usage

This LoRA requires the ComfyUI-Licon-MSR plugin for ComfyUI. A sample workflow is included in the model files for easy testing and experimentation.

Key Features

Multi-Reference Visual Memory

  • Token-level reference preservation: Multiple reference images are encoded as video latents, preserving fine-grained visual information at token level rather than compressing into a single embedding
  • Native self-attention retrieval: The target video tokens directly access reference tokens through the model's existing self-attention mechanism—no new architectural components needed
  • In-context conditioning: References serve as "visual memory" within the main token sequence, not as external conditioning inputs

Flexible Reference Composition

  • 2 to 5 reference images: Supports varying numbers of reference inputs with increasing complexity
  • Complementary semantic roles: Each reference image can carry different information:
    • Subject identity
    • Object/prop details
    • Scene/background
    • Local textures
    • Multiple viewpoints

What It Can Do

Identity Preservation Across References

Generate videos where multiple reference identities are simultaneously preserved:

  • Multiple characters from different reference images
  • Character + object combinations
  • Object + scene compositions

Relation-Based Composition

Beyond mere identity preservation, the model can compose references based on textual relation descriptions:

  • Action interactions (handing, picking up, pushing)
  • Spatial relationships (left-right, foreground-background)
  • Temporal event structures (start → process → result)

Cross-Reference Attribute Selection

The model learns to selectively retrieve attributes from different references:

  • Face from reference A, clothing from reference B
  • Object identity from one reference, pose/position from another
  • Background elements from scene references

Current Issues (Test Version)

  • High-motion limb distortion: Significant degradation in limb quality during fast or complex motion sequences
  • Slight object consistency loss: Minor identity drift for objects throughout the video duration

Results Showcase

2-Reference Comparison

Reference Images Our Model Seedance2.0
▶ Play ▶ Play

4-Reference Comparison

Reference Images Our Model Seedance2.0
▶ Play ▶ Play
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support