Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

Authors: Jonghyun Lee, Hansam Cho, YoungJoon Yoo, Seoung Bum Kim, Yonghyun Jeong

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We quantitatively evaluate our model against other baseline models and gauge the fidelity of samples and robustness to multiple input conditions, and demonstrate that our model substantially outperforms other models on various metrics.
Researcher Affiliation Collaboration Jonghyun Lee1,2 , Hansam Cho1,2, Youngjoon Yoo2, Seoung Bum Kim1 , Yonghyun Jeong2 1Korea University, 2NAVER Cloud
Pseudocode Yes Algorithm 1 Soft guidance for a single training timestep t
Open Source Code Yes The source code and pretrained models can be found at https://github.com/tomtom1103/compose-and-conquer.
Open Datasets Yes Our synthetic image triplets If, Ib, M are generated from two distinct datasets: COCOStuff (Caesar et al., 2018) and Pick-a-Pic (Kirstain et al., 2023).
Dataset Splits Yes We train our local fuser with the cloned E and C for 28 epochs, our global fuser for 24 epochs, and finetune the full model for 9 epochs, all with a batch size of 32 across 8 NVIDIA V100s." and "Table 1 reports the results evaluated on 5K images of the COCO-Stuff validation set.
Hardware Specification Yes We train our local fuser with the cloned E and C for 28 epochs, our global fuser for 24 epochs, and finetune the full model for 9 epochs, all with a batch size of 32 across 8 NVIDIA V100s.
Software Dependencies No The paper mentions software components like 'CLIP image encoder' and 'Stable Diffusion' variants, but does not provide specific version numbers for any software dependencies, such as PyTorch or other libraries.
Experiment Setup Yes During training, images are resized and center cropped to a resolution of 512 512. We train our local fuser with the cloned E and C for 28 epochs, our global fuser for 24 epochs, and finetune the full model for 9 epochs, all with a batch size of 32 across 8 NVIDIA V100s. During training, we set an independent dropout probability for each condition to ensure that our model learns to generalize various combinations. For our evaluation, we employ DDIM (Song et al., 2020) sampling with 50 steps, and a CFG (Ho & Salimans, 2021) scale of 7 to generate images of 768 768.