The development of Text-to-Video (T2V) generation has made motion transfer possible, enabling the control of video motion based on existing footage. However, current methods have two limitations: 1) struggle to handle multi-subjects videos, failing to transfer specific subject motion; 2) struggle to preserve the diversity and accuracy of motion as transferring to subjects with varying shapes. To overcome these, we introduce ConMo, a training-free framework that disentangle and recompose the motions of subjects and camera movements. ConMo isolates individual subject and background motion cues from complex trajectories in source videos using only subject masks, and reassembles them during target video generation. This approach enables more accurate motion control across diverse subjects and improves performance in multi-subject scenarios. Additionally, we propose soft motion guidance in the recomposition stage which controls the retention of original motion to adjust shape constraints, aiding subject shape adaptation and semantic transformation. Unlike previous methods, ConMo unlocks a wide range of applications, including subject size and position editing, subject removal, semantic modifications, and camera motion simulation. Extensive experiments demonstrate that ConMo significantly outperforms state-of-the-art methods in motion fidelity and temporal consistency.
The method mainly consists of two stages: (a) Reference Video's Motion Disentanglement Stage: We first acquire the masks for each subject in the reference video using SAM2 and video latent features acquire during DDIM inversion. Then, based on the mask, we identify the motion regions of each subject across different frames in the reference video. By calculating the difference of local spatial marginal means of latent features in these regions, we disentangle each subject’s motion. (b) Motion Recomposition for Target Video Generation Stage: The extracted motion is integrated into the initial noise via the Motion Guidance function and Soft Guidance strategy. This allows generating target videos with consistent motion and adaptive shape handling. The method supports various video editing effects like semantic changes, object removal, position editing, and camera simulation.
Orignal Prompt: Two race cars are driving down a street
Target Prompt: Two SUVs are driving down a street
Original
DMT
ConMo(ours)
Orignal Prompt: Two people are running down a street
Target Prompt: Two robots are running down a street
Original
DMT
ConMo(ours)
Orignal Prompt: A woman is walking in the park
Target Prompt: A boy is walking in the park
Original
DMT
ConMo(ours)
Orignal Prompt: A mallard is running on the grass
Target Prompt: A swan is running on the grass
Original
DMT
ConMo(ours)
Orignal Prompt: A plane is flying in the sky
Target Prompt: A air balloon is flying in the sky
Original
DMT
ConMo(ours)
Original Prompt: A locomotive is moving through the jungle
Target Prompt: A man riding bike is moving through the jungle
Original
DMT
ConMo(ours)
Orignal Prompt: A car is turning
Target Prompt: City at night from overlook view
Original
ConMo(ours)
Orignal Prompt: Two boys are drving go-karts
Target Prompt: A teddy bear is drving go-kart
Original
ConMo(ours)
@article{gao2025conmo,
title={ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer},
author={Gao, Jiayi and Yin, Zijin and Hua, Changcheng and Peng, Yuxin and Liang, Kongming and Ma, Zhanyu and Guo, Jun and Liu, Yang},
journal={arXiv preprint arXiv:2504.02451},
year={2025}
}