LTX-2.3-ID-LoRA-CelebVHQ-3K
This repository contains the ID-LoRA checkpoint trained on the CelebV-HQ dataset, as introduced in the paper ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA.
Project Page | GitHub | Paper
LTX-2.3 variant -- This checkpoint is trained on the newer LTX-2.3 (22B) base model with 3,000 training steps. For the original LTX-2 (19B) version, see ID-LoRA-CelebVHQ.
ID-LoRA (Identity-Driven In-Context LoRA) jointly generates a subject's appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together. Built on top of LTX-2.3, it is the first method to personalize visual appearance and voice within a single generative pass.
Details
| Property | Value |
|---|---|
| Base model | LTX-2.3 22B |
| Training dataset | CelebV-HQ |
| LoRA rank | 128 |
| Training steps | 3,000 |
| Strategy | audio_ref_only_ic with negative temporal positions |
Usage
This checkpoint requires the ID-LoRA-2.3 packages. Clone the official repository, switch the workspace to the LTX-2.3 packages, and download models:
git clone https://github.com/ID-LoRA/ID-LoRA.git && cd ID-LoRA
# Point workspace at LTX-2.3 packages (edit pyproject.toml):
# [tool.uv.workspace]
# members = ["ID-LoRA-2.3/packages/*"]
uv sync --frozen
bash ID-LoRA-2.3/scripts/download_models.sh
Two-Stage Inference (Recommended)
Generates at 512x512 then upscales to 1024x1024 with a distilled LoRA refinement pass.
python ID-LoRA-2.3/scripts/inference_two_stage.py \
--lora-path models/id-lora-celebvhq/lora_weights.safetensors \
--reference-audio examples/reference.wav \
--first-frame examples/first_frame.png \
--prompt "[VISUAL]: A close-up of a person speaking. [SPEECH]: Hello world. [SOUNDS]: Clear speech." \
--output-dir outputs/two_stage \
--quantize
Two-Stage HQ (New in v2.3)
Higher-quality variant using the Res2s sampler and rescaling guidance. Uses fewer steps (15 vs 30) but produces higher fidelity results.
python ID-LoRA-2.3/scripts/inference_two_stage_hq.py \
--lora-path models/id-lora-celebvhq/lora_weights.safetensors \
--reference-audio examples/reference.wav \
--first-frame examples/first_frame.png \
--prompt "[VISUAL]: A close-up of a person speaking. [SPEECH]: Hello world. [SOUNDS]: Clear speech." \
--output-dir outputs/two_stage_hq \
--quantize
One-Stage (Faster, Lower VRAM)
Generates at a single resolution without upscaling.
python ID-LoRA-2.3/scripts/inference_one_stage.py \
--lora-path models/id-lora-celebvhq/lora_weights.safetensors \
--reference-audio examples/reference.wav \
--first-frame examples/first_frame.png \
--prompt "[VISUAL]: A close-up of a person speaking. [SPEECH]: Hello world. [SOUNDS]: Clear speech." \
--output-dir outputs/one_stage \
--quantize
Files
lora_weights.safetensors-- LoRA adapter weights (~1.1 GB)training_config.yaml-- Training configuration used to produce this checkpoint
Citation
@misc{dahan2026idloraidentitydrivenaudiovideopersonalization,
title = {ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA},
author = {Aviad Dahan and Moran Yanuka and Noa Kraicer and Lior Wolf and Raja Giryes},
year = {2026},
eprint = {2603.10256},
archivePrefix = {arXiv},
primaryClass = {cs.SD},
url = {https://arxiv.org/abs/2603.10256}
}
- Downloads last month
- 7
Model tree for qqceqqq/LTX-2.3-ID-LoRA-CelebVHQ-3K
Base model
Lightricks/LTX-Video