Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance
Authors: Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, Fangyuan Zou
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate the effectiveness of our approach in producing high-quality human motion videos. Videos and comparisons are available at https://tencent.github.io/MimicMotion. |
| Researcher Affiliation | Collaboration | 1Tencent 2Shanghai Jiao Tong University. Correspondence to: Jiaxi Gu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Progressive latent fusion for long videos. |
| Open Source Code | No | The abstract states: "Videos and comparisons are available at https://tencent.github.io/MimicMotion." This link points to a project demonstration page, not a specific code repository for the methodology described in the paper. No other explicit statement about code release or repository link for the authors' own method was found. |
| Open Datasets | Yes | We evaluate performance on test sequences from the Tik Tok dataset (Jafarian & Park, 2021). Jafarian, Y. and Park, H. S. Learning high fidelity depths of dressed humans by watching social media dance videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12753 12762, June 2021. |
| Dataset Splits | Yes | For the testing protocol of previous works (Wang et al., 2023; Chang et al., 2023), we adopt the Tik Tok (Jafarian & Park, 2021) dataset and use sequence 335 to 340 for our evaluation. |
| Hardware Specification | Yes | We train our model on 8 NVIDIA A100 GPUs for 20 epochs, with a batch size of 8 and 16 frames per clip. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python version, library versions) are mentioned in the paper. |
| Experiment Setup | Yes | We train our model on 8 NVIDIA A100 GPUs for 20 epochs, with a batch size of 8 and 16 frames per clip. The loss weight of the hand region is 10. The learning rate is 10^-5 with a linear warmup of 500 iterations. We tune all parameters in the UNet and Pose Net. We follow Stable video diffusion and adopt the noise distribution, i.e. log σ N(Pmean, P 2 std), proposed by Karras et al (Karras et al., 2022) with parameter Pmean = 0.5 and Pstd = 1.4. |