reproducibilityindex.ai

Decouple Content and Motion for Conditional Image-to-Video Generation

Authors: Cuifeng Shen, Yulu Gan, Chen Chen, Xiongwei Zhu, Lele Cheng, Tingting Gao, Jinzhi Wang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on various datasets confirm our approach s superior performance over the majority of state-of-the-art methods in both effectiveness and efficiency.
Researcher Affiliation	Collaboration	Cuifeng Shen1, Yulu Gan1, Chen Chen2 , Xiongwei Zhu3, Lele Cheng3, Tingting Gao3, Jinzhi Wang1* 1Peking University 2Chinese Academy of Science 3Kuaishou Technology
Pseudocode	No	The paper describes algorithms using mathematical equations and textual descriptions, but it does not include a clearly labeled 'Pseudocode' or 'Algorithm' block or figure.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code or provide a link to a code repository for the methodology described.
Open Datasets	Yes	We conduct our experiment on well-known video datasets used for image-to-video generation: MHAD (Chen, Jafari, and Kehtarnavaz 2015), NATOPS (Yale Song and Davis 2011), and BAIR (Ebert et al. 2017).
Dataset Splits	Yes	For training and testing purposes, we ve randomly picked 602 videos from all subjects for the training set and 259 videos for the testing set. (MHAD dataset) ...We have arbitrarily chosen 6720 videos from all subjects for the training phase, while the remaining videos are for the testing phase. (NATOPS dataset)
Hardware Specification	No	The paper mentions FLOPs and Memory (GB) consumption in Table 4, which are resource usages, but does not specify the exact hardware components (e.g., GPU models, CPU types) used for the experiments.
Software Dependencies	No	The paper mentions software components like '3D U-Net architecture', 'CLIP', and 'VAE', but it does not specify version numbers for any of these software dependencies.
Experiment Setup	Yes	We use a conditional 3D U-Net architecture as the denoising network, and directly apply the multi-head selfattention (Cheng, Dong, and Lapata 2016) mechanism to the 3D video signal. Additionally, we use a Res Net(He et al. 2016) block to encode the first frame as a conditional feature map and provided it to ϵθ by the concatenation with the noise ϵ. ... For ED-VDM, ... we employ a VAE (Rombach et al. 2022) with slight KL-regularization of 1e 6 to encode the residual into a 16 × 16 × 16 latent representation.