ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer

Jiayi Gao1,2†, Zijin Yin2, Changcheng Hua1, Yuxin Peng1, Kongming Liang2, Zhanyu Ma2, Jun Guo2 Yang Liu1*
1Wangxuan Institute of Computer Technology, Peking University 2Beijing University of Posts and Telecommunications

Abstract

The development of Text-to-Video (T2V) generation has made motion transfer possible, enabling the control of video motion based on existing footage. However, current methods have two limitations: 1) struggle to handle multi-subjects videos, failing to transfer specific subject motion; 2) struggle to preserve the diversity and accuracy of motion as transferring to subjects with varying shapes. To overcome these, we introduce ConMo, a training-free framework that disentangle and recompose the motions of subjects and camera movements. ConMo isolates individual subject and background motion cues from complex trajectories in source videos using only subject masks, and reassembles them during target video generation. This approach enables more accurate motion control across diverse subjects and improves performance in multi-subject scenarios. Additionally, we propose soft motion guidance in the recomposition stage which controls the retention of original motion to adjust shape constraints, aiding subject shape adaptation and semantic transformation. Unlike previous methods, ConMo unlocks a wide range of applications, including subject size and position editing, subject removal, semantic modifications, and camera motion simulation. Extensive experiments demonstrate that ConMo significantly outperforms state-of-the-art methods in motion fidelity and temporal consistency.

Method Overview

Method Overview

The method mainly consists of two stages: (a) Reference Video's Motion Disentanglement Stage: We first acquire the masks for each subject in the reference video using SAM2 and video latent features acquire during DDIM inversion. Then, based on the mask, we identify the motion regions of each subject across different frames in the reference video. By calculating the difference of local spatial marginal means of latent features in these regions, we disentangle each subject’s motion. (b) Motion Recomposition for Target Video Generation Stage: The extracted motion is integrated into the initial noise via the Motion Guidance function and Soft Guidance strategy. This allows generating target videos with consistent motion and adaptive shape handling. The method supports various video editing effects like semantic changes, object removal, position editing, and camera simulation.

ConMo Results

1. Multi-Subject

Orignal Prompt: Two race cars are driving down a street

Target Prompt: Two SUVs are driving down a street

Original

DMT

ConMo(ours)

Orignal Prompt: Two people are running down a street

Target Prompt: Two robots are running down a street

Original

DMT

ConMo(ours)

2. Fine-Grained

Orignal Prompt: A woman is walking in the park

Target Prompt: A boy is walking in the park

Original

DMT

ConMo(ours)

Orignal Prompt: A mallard is running on the grass

Target Prompt: A swan is running on the grass

Original

DMT

ConMo(ours)

3. Drastic Shape Difference

Orignal Prompt: A plane is flying in the sky

Target Prompt: A air balloon is flying in the sky

Original

DMT

ConMo(ours)

Original Prompt: A locomotive is moving through the jungle

Target Prompt: A man riding bike is moving through the jungle

Original

DMT

ConMo(ours)

4. Other Applications: Camera Motion Transfer, Single Motion-extraction and Transfer

Orignal Prompt: A car is turning

Target Prompt: City at night from overlook view

Original

ConMo(ours)

5. Other Applications: Single Motion-extraction and Transfer

Orignal Prompt: Two boys are drving go-karts

Target Prompt: A teddy bear is drving go-kart

Original

ConMo(ours)

BibTeX

@article{gao2025conmo,
      title={ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer},
      author={Gao, Jiayi and Yin, Zijin and Hua, Changcheng and Peng, Yuxin and Liang, Kongming and Ma, Zhanyu and Guo, Jun and Liu, Yang},
      journal={arXiv preprint arXiv:2504.02451},
      year={2025}
    }