Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance
Authors: Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, bin xia, Dingdong WANG, Hongwei Yi, Xihui Liu, Hengshuang Zhao, Yu Liu, Yingya Zhang, Yujiu Yang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on Move Bench and public datasets show that Wan-Move supports diverse motion-control tasks and delivers commercial-grade results with scaled training. Section 5, titled 'Experiment', details the experimental setup, main results, and ablation studies, including quantitative comparisons using metrics like FID, FVD, PSNR, SSIM, and EPE, and a human study. |
| Researcher Affiliation | Collaboration | 1Tongyi Lab, Alibaba Group 2Tsinghua University 3HKU 4CUHK |
| Pseudocode | No | The paper describes its methodology and model architecture primarily through descriptive text, equations (Eq. 1-5), and diagrams (Figure 2, Figure 3). It does not contain any clearly labeled pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | Github: https://github.com/ali-vilab/Wan-Move. Code, models, and benchmark data are made available. |
| Open Datasets | Yes | To support comprehensive evaluation, we further design Move Bench, a rigorously curated benchmark featuring diverse content categories and hybrid-verified annotations. It is distinguished by larger data volume, longer video durations, and high-quality motion annotations. Extensive experiments on Move Bench and the public dataset consistently show Wan-Move s superior motion quality. Code, models, and benchmark data are made available. Training data. We curate a high-quality training dataset, which undergoes rigorous two-stage filtering to ensure both visual quality and motion consistency. First, we manually annotate the visual quality of 1,000 samples and use them to train an expert scoring model for initial quality assessment. To further enhance temporal coherence, we introduce a motion quality filtering stage. Table 10 presents the composition of the filtered training datasets, which are sourced from Panda70M [77], Pixabay [78], Pexels [70], and You Tube. You Tube videos are independently collected for this study. To prevent data leakage, the videos from Pexels are strictly separated from those in the proposed Move Bench. Quantitative results on Move Bench and the public DAVIS [21] are shown in Table 1. |
| Dataset Splits | No | The paper mentions that it uses a final dataset of 2 million high-quality 720p videos for training, and introduces Move Bench as a benchmark with 1018 videos for evaluation. It also states that for training iterations, 'we sample k trajectories from a mixed distribution: with 5% probability, no trajectory is used (k = 0); with 95% probability, k is uniformly sampled from 1 to 200.' However, it does not explicitly detail specific training/validation/test splits (e.g., percentages or counts) for its main 2 million video training dataset or how Move Bench relates to these splits beyond being an evaluation benchmark. |
| Hardware Specification | Yes | We train our model using 64 NVIDIA A100 GPUs, with each GPU processing a quarter of the sequence length, for a total of 30,000 steps. |
| Software Dependencies | No | During training, both the Di T and um T5 components of Wan are wrapped with Fully Sharded Data Parallel (FSDP) [80], with parameters cast to torch.bfloat16 for memory efficiency. The training employs the Adam W optimizer [81] with a weight decay of 1e-3 and a base learning rate of 5e-6. The first 2,000 steps are used for linear warm-up to enable a smooth transition from the initial I2V generation (corresponding to 0 point trajectories) to motion-controllable video generation. We adopt flow matching objective for optimization, where the number of time sampling steps is set to 1,000 during training. While specific libraries like PyTorch (implied by FSDP) and AdamW are mentioned, explicit version numbers for these software components are not provided in the paper. |
| Experiment Setup | Yes | During training, both the Di T and um T5 components of Wan are wrapped with Fully Sharded Data Parallel (FSDP) [80], with parameters cast to torch.bfloat16 for memory efficiency. The training employs the Adam W optimizer [81] with a weight decay of 1e-3 and a base learning rate of 5e-6. The first 2,000 steps are used for linear warm-up to enable a smooth transition from the initial I2V generation (corresponding to 0 point trajectories) to motion-controllable video generation. We adopt flow matching objective for optimization, where the number of time sampling steps is set to 1,000 during training. To enable large-scale training with long sequences (e.g., 5s video clip), we adopt the Ulysses sequence parallelism strategy [82] following Wan, setting the sequence parallel size to 4. We train our model using 64 NVIDIA A100 GPUs, with each GPU processing a quarter of the sequence length, for a total of 30,000 steps. During inference, we use a classifier-free guidance scale w of 5.0 unless otherwise specified. |