Decouple Content and Motion for Conditional Image-to-Video Generation
Authors: Cuifeng Shen, Yulu Gan, Chen Chen, Xiongwei Zhu, Lele Cheng, Tingting Gao, Jinzhi Wang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on various datasets confirm our approach s superior performance over the majority of state-of-the-art methods in both effectiveness and efficiency. |
| Researcher Affiliation | Collaboration | Cuifeng Shen1, Yulu Gan1, Chen Chen2 , Xiongwei Zhu3, Lele Cheng3, Tingting Gao3, Jinzhi Wang1* 1Peking University 2Chinese Academy of Science 3Kuaishou Technology |
| Pseudocode | No | The paper describes algorithms using mathematical equations and textual descriptions, but it does not include a clearly labeled 'Pseudocode' or 'Algorithm' block or figure. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or provide a link to a code repository for the methodology described. |
| Open Datasets | Yes | We conduct our experiment on well-known video datasets used for image-to-video generation: MHAD (Chen, Jafari, and Kehtarnavaz 2015), NATOPS (Yale Song and Davis 2011), and BAIR (Ebert et al. 2017). |
| Dataset Splits | Yes | For training and testing purposes, we ve randomly picked 602 videos from all subjects for the training set and 259 videos for the testing set. (MHAD dataset) ...We have arbitrarily chosen 6720 videos from all subjects for the training phase, while the remaining videos are for the testing phase. (NATOPS dataset) |
| Hardware Specification | No | The paper mentions FLOPs and Memory (GB) consumption in Table 4, which are resource usages, but does not specify the exact hardware components (e.g., GPU models, CPU types) used for the experiments. |
| Software Dependencies | No | The paper mentions software components like '3D U-Net architecture', 'CLIP', and 'VAE', but it does not specify version numbers for any of these software dependencies. |
| Experiment Setup | Yes | We use a conditional 3D U-Net architecture as the denoising network, and directly apply the multi-head selfattention (Cheng, Dong, and Lapata 2016) mechanism to the 3D video signal. Additionally, we use a Res Net(He et al. 2016) block to encode the first frame as a conditional feature map and provided it to ϵθ by the concatenation with the noise ϵ. ... For ED-VDM, ... we employ a VAE (Rombach et al. 2022) with slight KL-regularization of 1e 6 to encode the residual into a 16 × 16 × 16 latent representation. |