Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis
Authors: Jonghyun Lee, Hansam Cho, YoungJoon Yoo, Seoung Bum Kim, Yonghyun Jeong
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We quantitatively evaluate our model against other baseline models and gauge the fidelity of samples and robustness to multiple input conditions, and demonstrate that our model substantially outperforms other models on various metrics. |
| Researcher Affiliation | Collaboration | Jonghyun Lee1,2 , Hansam Cho1,2, Youngjoon Yoo2, Seoung Bum Kim1 , Yonghyun Jeong2 1Korea University, 2NAVER Cloud |
| Pseudocode | Yes | Algorithm 1 Soft guidance for a single training timestep t |
| Open Source Code | Yes | The source code and pretrained models can be found at https://github.com/tomtom1103/compose-and-conquer. |
| Open Datasets | Yes | Our synthetic image triplets If, Ib, M are generated from two distinct datasets: COCOStuff (Caesar et al., 2018) and Pick-a-Pic (Kirstain et al., 2023). |
| Dataset Splits | Yes | We train our local fuser with the cloned E and C for 28 epochs, our global fuser for 24 epochs, and finetune the full model for 9 epochs, all with a batch size of 32 across 8 NVIDIA V100s." and "Table 1 reports the results evaluated on 5K images of the COCO-Stuff validation set. |
| Hardware Specification | Yes | We train our local fuser with the cloned E and C for 28 epochs, our global fuser for 24 epochs, and finetune the full model for 9 epochs, all with a batch size of 32 across 8 NVIDIA V100s. |
| Software Dependencies | No | The paper mentions software components like 'CLIP image encoder' and 'Stable Diffusion' variants, but does not provide specific version numbers for any software dependencies, such as PyTorch or other libraries. |
| Experiment Setup | Yes | During training, images are resized and center cropped to a resolution of 512 512. We train our local fuser with the cloned E and C for 28 epochs, our global fuser for 24 epochs, and finetune the full model for 9 epochs, all with a batch size of 32 across 8 NVIDIA V100s. During training, we set an independent dropout probability for each condition to ensure that our model learns to generalize various combinations. For our evaluation, we employ DDIM (Song et al., 2020) sampling with 50 steps, and a CFG (Ho & Salimans, 2021) scale of 7 to generate images of 768 768. |