Neural Residual Diffusion Models for Deep Scalable Vision Generation

Authors: Zhiyuan Ma, Liangliang Zhao, Biqing Qi, Bowen Zhou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on various generative tasks show that the proposed neural residual models obtain state-of-the-art scores on image s and video s generative benchmarks. Rigorous theoretical proofs and extensive experiments also demonstrate the advantages of this simple gated residual mechanism consistent with dynamic modeling in improving the fidelity and consistency of generated content and supporting large-scale scalable training.
Researcher Affiliation Collaboration Zhiyuan Ma1, Liangliang Zhao1,2, Biqing Qi3, Bowen Zhou1,3 1Department of Electronic Engineering, Tsinghua University, Beijing, China 2Frontis.AI, Beijing, China 3Shanghai AI Laboratory, Shanghai, China
Pseudocode No The paper does not include any figures, blocks, or sections explicitly labeled 'Pseudocode' or 'Algorithm,' nor does it present structured steps formatted like code or an algorithm.
Open Source Code Yes Code is available at https://github.com/ponyzym/Neural-RDM.
Open Datasets Yes For image synthesis tasks, we train and evaluate the Class-to-Image generation models on the Image Net [61] dataset and train and evaluate the Text-to-Image generation models on the MSCOCO [65] and Journey DB [53] datasets... For video generation tasks, we follow the previous works [12, 60] to train None-to-Video (i.e., unconditional video generation) models on the Sky Timelapse [62] and Taichi [63] datasets, and train Class-to-Video models on the UCF-101 [64] dataset.
Dataset Splits No The paper mentions training and evaluation on datasets but does not explicitly specify a distinct validation dataset split (e.g., with percentages or sample counts) for hyperparameter tuning or early stopping. It states, 'Eventually, we utilize the Adam W optimizer with a constant learning rate of 5 104 for all models and exploit an exponential moving average (EMA) strategy to obtain and report all results,' which describes training procedures but not a validation split.
Hardware Specification No The paper's NeurIPS checklist justification for 'Experiments Compute Resources' states, 'We have reported the GPU power used in the experiments in Sec. 3.1.' However, Section 3.1, which details experimental settings, does not contain any specific hardware information, such as GPU models (e.g., NVIDIA A100, Tesla V100), CPU types, or memory specifications.
Software Dependencies No The paper mentions using 'LDM [30] and Latte [60]' as base models and the 'Adam W optimizer.' However, it does not provide specific version numbers for any software dependencies, such as Python, PyTorch, TensorFlow, or CUDA versions, which are necessary for reproducible software environment setup.
Experiment Setup Yes Implementation details. We implement our Neural-RDMs into Neural-RDM-U (U-shaped) and Neural-RDM-F (Flow-shaped) two versions on top of the current state-of-the-art diffusion models LDM [30] and Latte [60] for image generation, and further employ the Neural-RDM-F version for video generation. Specifically, we first load the corresponding pre-trained models and initialize gating parameters {α = 1, β = 0} of each layer, then perform full-parameter fine-tuning to implicitly learn the distribution of the data for acting as a parameterized mean-variance scheduler. During the training process, we adopt an explicit supervision strategy to enhance the sensitivity correction capabilities of α and β for deep scalable training, where the explicitly supervised hyper-parameter γ is set to 0.35. Eventually, we utilize the Adam W optimizer with a constant learning rate of 5 104 for all models and exploit an exponential moving average (EMA) strategy to obtain and report all results.