reproducibilityindex.ai

Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

Authors: Jonghyun Lee, Hansam Cho, YoungJoon Yoo, Seoung Bum Kim, Yonghyun Jeong

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We quantitatively evaluate our model against other baseline models and gauge the fidelity of samples and robustness to multiple input conditions, and demonstrate that our model substantially outperforms other models on various metrics.
Researcher Affiliation	Collaboration	Jonghyun Lee1,2 , Hansam Cho1,2, Youngjoon Yoo2, Seoung Bum Kim1 , Yonghyun Jeong2 1Korea University, 2NAVER Cloud
Pseudocode	Yes	Algorithm 1 Soft guidance for a single training timestep t
Open Source Code	Yes	The source code and pretrained models can be found at https://github.com/tomtom1103/compose-and-conquer.
Open Datasets	Yes	Our synthetic image triplets If, Ib, M are generated from two distinct datasets: COCOStuff (Caesar et al., 2018) and Pick-a-Pic (Kirstain et al., 2023).
Dataset Splits	Yes	We train our local fuser with the cloned E and C for 28 epochs, our global fuser for 24 epochs, and finetune the full model for 9 epochs, all with a batch size of 32 across 8 NVIDIA V100s." and "Table 1 reports the results evaluated on 5K images of the COCO-Stuff validation set.
Hardware Specification	Yes	We train our local fuser with the cloned E and C for 28 epochs, our global fuser for 24 epochs, and finetune the full model for 9 epochs, all with a batch size of 32 across 8 NVIDIA V100s.
Software Dependencies	No	The paper mentions software components like 'CLIP image encoder' and 'Stable Diffusion' variants, but does not provide specific version numbers for any software dependencies, such as PyTorch or other libraries.
Experiment Setup	Yes	During training, images are resized and center cropped to a resolution of 512 512. We train our local fuser with the cloned E and C for 28 epochs, our global fuser for 24 epochs, and finetune the full model for 9 epochs, all with a batch size of 32 across 8 NVIDIA V100s. During training, we set an independent dropout probability for each condition to ensure that our model learns to generalize various combinations. For our evaluation, we employ DDIM (Song et al., 2020) sampling with 50 steps, and a CFG (Ho & Salimans, 2021) scale of 7 to generate images of 768 768.