Compositional Image Decomposition with Diffusion Models
Authors: Jocelin Su, Nan Liu, Yanbo Wang, Joshua B. Tenenbaum, Yilun Du
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments In this section, we evaluate the ability of our approach to decompose images. First, we assess decomposition of images into global factors of variation in Section 4.2. We next evaluate decomposition of images into local factors of variation in Section 4.3. We further investigate the ability of decomposed components to recombine across separate trained models in Section 4.4. Finally, we illustrate how our approach can be adapted to pretrained models in Section 4.5. For quantitative evaluation of image quality, we employ Fr echet Inception Distance (FID) (Heusel et al., 2017), Kernel Inception Distance (KID) (Bi nkowski et al., 2018), and LPIPS (Zhang et al., 2018) on images reconstructed from Celeb A-HQ (Karras et al., 2017), Falcor3D (Nie et al., 2020), Virtual KITTI 2 (Cabon et al., 2020), and CLEVR (Johnson et al., 2017). |
| Researcher Affiliation | Academia | Jocelin Su 1 * Nan Liu 2 * Yanbo Wang 3 * Joshua B. Tenenbaum 1 Yilun Du 1 1MIT 2UIUC 3TU Delft. |
| Pseudocode | Yes | Algorithm 1 Training Algorithm and Algorithm 2 Image Generation Algorithm. |
| Open Source Code | Yes | Code and visualizations are at https://energy-based-model.github.io/decomp-diffusion. |
| Open Datasets | Yes | For quantitative evaluation of image quality, we employ Fr echet Inception Distance (FID) (Heusel et al., 2017), Kernel Inception Distance (KID) (Bi nkowski et al., 2018), and LPIPS (Zhang et al., 2018) on images reconstructed from Celeb A-HQ (Karras et al., 2017), Falcor3D (Nie et al., 2020), Virtual KITTI 2 (Cabon et al., 2020), and CLEVR (Johnson et al., 2017). |
| Dataset Splits | No | The paper mentions total dataset sizes (e.g., 'CLEVR 10K', 'Celeb A-HQ 30K') but does not specify explicit training, validation, or test splits (e.g., percentages or sample counts) for reproducibility. |
| Hardware Specification | Yes | Each model is trained for 24 hours on an NVIDIA V100 32GB machine or an NVIDIA Ge Force RTX 2080 24GB machine. |
| Software Dependencies | No | The paper mentions using a 'standard U-Net architecture' and referencing codebases for baselines, but it does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA for their own implementation. |
| Experiment Setup | Yes | We used standard denoising training to train our denoising networks, with 1000 diffusion steps and squared cosine beta schedule. In our implementation, the denoising network ϵθ is trained to directly predict the original image x0, since we show this leads to better performance due to the similarity between our training objective and autoencoder training. [...] We use a batch size of 32 when training. |