VIDM: Video Implicit Diffusion Models
Authors: Kangfu Mei, Vishal Patel
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Various experiments are conducted on datasets consisting of videos with different resolutions and different number of frames. Results show that the proposed method outperforms the state-of-the-art generative adversarial networkbased methods by a significant margin in terms of FVD scores as well as perceptible visual quality. The effectiveness of the proposed model is demonstrated on various datasets by comparing the performance with several state-of-the-art works. We present the main quantitative results comparison in Table 1 and Table 2, and the main qualitative results comparison is Figure 3. |
| Researcher Affiliation | Academia | Johns Hopkins University |
| Pseudocode | No | The paper provides mathematical formulations for its methods, such as equations for learning objectives and transformations. However, it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper lists a project page URL (https://kfmei.page/vidm/) on the first page. However, it does not include an explicit statement confirming the release of source code at this URL, nor does the URL directly link to a source-code repository. |
| Open Datasets | Yes | The experiments are conducted on UCF-101 (Soomro, Zamir, and Shah 2012), Tai Chi-HD (Siarohin et al. 2019), Sky Time-lapse (Xiong et al. 2018), and CLEVRER (Yi et al. 2020). |
| Dataset Splits | No | The paper states, 'All evaluation is conducted on 2048 randomly selected real and generated videos for reducing variance.' This describes the data used for evaluation, but it does not specify the train/validation/test splits of the primary datasets (UCF-101, Tai Chi-HD, etc.) used during model training. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments, such as CPU or GPU models, memory, or cloud computing specifications. |
| Software Dependencies | No | The paper mentions several software components and architectures like 'U-Net', 'Multi-Head Attention', 'Group Norm', 'Pixel CNN++', and 'Spy Net'. However, it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | The diffusion network architecture of our method is an autoencoder network that follows the design of Pixel CNN++ (Salimans et al. 2017). We apply multiple multi-head attention modules (Vaswani et al. 2017) at features in a resolution of 16 16 for capturing longrange dependence that benefits the perceptual quality. For robustness penalty, 'η is a constant that is experimentally set as 1e 8'. |