Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions.
Overview of VideoMage. (a) VideoMage leverages LoRAs to learn visual appearance and appearance-agnostic motion (see Fig. 1) separately. (b) During inference, given a text prompt describing visual and motion concepts, our proposed Spatial-Temporal Collaborative Composition (see Fig. 2) integrates these learned LoRAs to generate videos with the desired visual and motion characteristics.
Figure 1. Appearance-Agnostic Motion Learning. To disentangle motion patterns from visual appearance, we first learn each subject's static appearance via Textual Inversion. Then, we apply negative classifier-free guidance to eliminate appearance information during motion learning, ensuring the motion LoRA captures only motion dynamics.
Figure 2. Spatial-Temporal Collaborative Composition. During inference, we first (a) fuse multiple subject LoRAs into a single LoRA. With the fused subject LoRA and the motion LoRA, we propose (b) Spatial-Temporal Collaborative Sampling (SCS) scheme that independently samples and combines noises from the subject and motion branches. To encourage early alignment, we introduce a collaborative guidance mechanism, where spatial and temporal attention maps from each branch are used to refine each other's input latents, ensuring both visual and temporal coherence.
@article{huang2025videomage,
title={VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models},
author={Huang, Chi-Pin and Wu, Yen-Siang and Chung, Hung-Kai and Chang, Kai-Po and Yang, Fu-En and Wang, Yu-Chiang Frank},
journal={arXiv preprint arXiv:2503.21781},
year={2025}
}