Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DisMo: Disentangled Motion Representations for Open-World Motion Transfer

Authors: Thomas Ressler-Antal, Frank Fundel, Malek Ben Alaya, Stefan Andreas Baumann, Felix Krause, Ming Gui, Björn Ommer

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our method through a diverse set of motion transfer tasks. Finally, we show that the learned representations are well-suited for downstream motion understanding tasks, consistently outperforming state-of-the-art video representation models such as V-JEPA in zero-shot action classification on benchmarks including Something-Something v2 and Jester.
Researcher Affiliation Academia Thomas Ressler-Antal Frank Fundel Malek Ben Alaya Stefan Andreas Baumann Felix Krause Ming Gui Björn Ommer Comp Vis @ LMU Munich, Munich Center for Machine Learning (MCML)
Pseudocode No The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor are there structured algorithmic steps presented in a code-like format.
Open Source Code Yes Project page: https://compvis.github.io/Dis Mo, We include anonymized code and detailed reproduction instructions in the supplementary material to ensure faithful replication of our main experimental results.
Open Datasets Yes We train ... on open-world videos from K-710 [Li et al., 2022], SSv2 [Goyal et al., 2017], Moments in Time [Monfort et al., 2019a] and Open Vid-1m [Nan et al., 2025]... we build upon prior evaluation strategies [Zhao et al., 2024, Jeong et al., 2024, Yatim et al., 2024, Park et al., 2024], assembling a diverse set of videos and text prompts primarily from the DAVIS dataset [Pont-Tuset et al., 2017].
Dataset Splits Yes We selected videos of 5 individuals and used only four actions ([ eat , run , walk , jump ]) for training, holding out the drink action for testing. We trained models on videos from four individuals (Andrea, Leyla, Gu, Georgios) and tested on videos from the fifth (Steve), ensuring no overlap in identity between train and test.
Hardware Specification Yes hardware dtype bfloat16 accelerator GH200 96G, We primarily use Nvidia GH200 96GB modules in an internal cluster for pre-training and Nvidia H200 141GB in an internal cluster for evaluations.
Software Dependencies No The paper does not explicitly provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) used for the experiments.
Experiment Setup Yes Table A: Pretraining hyper-parameters for our model. Category Hyper-parameter Value datasets K-710/SSv2/Mi T/Open Vid-1m resolution 256 ˆ 256 num_frames 8 fps 6 ... batch_size 32 total number of iterations 530k warmup iterations 5000 lr scheduler constant w/ warmup lr 10^-4 Adam W β p0.9, 0.95q architecture frame_embed_depth 12 frame_embed_dim 768 sequence_embed_depth 12 sequence_embed_dim 768 frame_generator_depth 28 frame_generator_dim 1152