How Diffusion Models Learn to Factorize and Compose

Authors: Qiyao Liang, Ziming Liu, Mitchell Ostrow, Ila Fiete

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We performed extensive controlled experiments on conditional Denoising Diffusion Probabilistic Models (DDPMs) trained to generate various forms of 2D Gaussian bump images. We found that the models learn factorized but not fully continuous manifold representations for encoding continuous features of variation underlying the data. With such representations, models demonstrate superior feature compositionality but limited ability to interpolate over unseen values of a given feature. Our experimental results further demonstrate that by training with independent factors of variation, diffusion models can attain compositionality with few compositional examples, suggesting a more efficient way to train DDPMs.
Researcher Affiliation Academia Qiyao Liang Ziming Liu Mitchell Ostrow Ila Fiete Massachusetts Institute of Technology {qiyao,zmliu,ostrow,fiete}@mit.edu
Pseudocode No The paper provides schematic diagrams of the UNet architecture and describes procedures in text, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: We have performed the standard experiments on standard UNet architectures with all necessary details of data generation and model training provided in Appendix A.
Open Datasets No The paper describes generating its own dataset: "We generate N N pixel grayscale images." and "A single dataset of these images consist of the enumeration of all possible Gaussians tiling the whole N N canvas". It does not provide any information or link to make this generated dataset publicly available.
Dataset Splits No The paper mentions "training and validation loss" in Appendix D.2, indicating that a validation set was used. However, it does not specify the exact split percentages or sample counts for this validation set, only that it refers to 1024 samples for evaluation on a grid.
Hardware Specification Yes We train the models on a quad-core Nvidia A100 GPU, and an average training session lasts around 6 hours (including intermediate sampling time).
Software Dependencies No The paper mentions using "Adam W optimizer" and a "learning rate scheduler from Pytorch" but does not specify version numbers for Python, PyTorch, or any other software libraries used for implementation.
Experiment Setup Yes The model is trained with the Adam W optimizer with built-in weight decay, and we employ a learning rate scheduler from Pytorch during training. We did not perform any hyperparameter tuning, although we would expect our observations to hold regardless. Each model is given an inversely proportional amount of training time as a function of the size of the dataset that it is trained on such that there is sufficient amount of training for the models to converge.