Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Authors: Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, Fangyuan Zou

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate the effectiveness of our approach in producing high-quality human motion videos. Videos and comparisons are available at https://tencent.github.io/MimicMotion.
Researcher Affiliation Collaboration 1Tencent 2Shanghai Jiao Tong University. Correspondence to: Jiaxi Gu <EMAIL>.
Pseudocode Yes Algorithm 1 Progressive latent fusion for long videos.
Open Source Code No The abstract states: "Videos and comparisons are available at https://tencent.github.io/MimicMotion." This link points to a project demonstration page, not a specific code repository for the methodology described in the paper. No other explicit statement about code release or repository link for the authors' own method was found.
Open Datasets Yes We evaluate performance on test sequences from the Tik Tok dataset (Jafarian & Park, 2021). Jafarian, Y. and Park, H. S. Learning high fidelity depths of dressed humans by watching social media dance videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12753 12762, June 2021.
Dataset Splits Yes For the testing protocol of previous works (Wang et al., 2023; Chang et al., 2023), we adopt the Tik Tok (Jafarian & Park, 2021) dataset and use sequence 335 to 340 for our evaluation.
Hardware Specification Yes We train our model on 8 NVIDIA A100 GPUs for 20 epochs, with a batch size of 8 and 16 frames per clip.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python version, library versions) are mentioned in the paper.
Experiment Setup Yes We train our model on 8 NVIDIA A100 GPUs for 20 epochs, with a batch size of 8 and 16 frames per clip. The loss weight of the hand region is 10. The learning rate is 10^-5 with a linear warmup of 500 iterations. We tune all parameters in the UNet and Pose Net. We follow Stable video diffusion and adopt the noise distribution, i.e. log σ N(Pmean, P 2 std), proposed by Karras et al (Karras et al., 2022) with parameter Pmean = 0.5 and Pstd = 1.4.