Composer: Creative and Controllable Image Synthesis with Composable Conditions

Authors: Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, Jingren Zhou

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Despite being trained in a multitask manner, Composer achieves a zero-shot FID of 9.2 in text-to-image synthesis on the COCO dataset (Lin et al., 2014) when using only caption as the condition, indicating its ability to produce high-quality results. We train a 2B parameter base model for conditional image generation at 64 64 resolution, a 1.1B parameter model for upscaling images to 256 256 resolution, and a 300M parameter model for further upscaling images to 1024 1024 resolution. We conduct user studies to evaluate the performance of four pretrained models trained using different settings on five generation tasks.
Researcher Affiliation Industry 1Alibaba Group 2Ant Group. Correspondence to: Lianghua Huang, Di Chen, Yu Liu <xuangen.hlh, guangpan.cd, ly103369@alibaba-inc.com>, Yujun Shen, Deli Zhao <shenyujun0302, zhaodeli@gmail.com>, Jingren Zhou <jingren.zhou@alibaba-inc.com>.
Pseudocode No The paper describes the system architecture and processes but does not include any formal pseudocode or algorithm blocks.
Open Source Code No Code and models will be made available.
Open Datasets Yes We train on a combination of public datasets, including Image Net21K (Russakovsky et al., 2014), Web Vision (Li et al., 2017), and a filtered version of the LAION dataset (Schuhmann et al., 2022) with around 1B images.
Dataset Splits No The paper mentions training on a combination of public datasets and fine-tuning on a subset of LAION, but it does not specify explicit training/validation/test splits (e.g., percentages or sample counts) for reproducibility.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No The paper mentions leveraging architectures and models like GLIDE, CLIP, and YOLOv5, and using DPM-Solver++, but it does not specify explicit version numbers for software dependencies or libraries required for reproduction.
Experiment Setup Yes We train a 2B parameter base model for conditional image generation at 64 64 resolution, a 1.1B parameter model for upscaling images to 256 256 resolution, and a 300M parameter model for further upscaling images to 1024 1024 resolution. Additionally, we trained a 1B parameter prior model for optionally projecting captions to image embeddings. We use batch sizes of 4096, 1024, 512, and 512 for the prior, base, and two upsampling models, respectively. We train on a combination of public datasets, including Image Net21K (Russakovsky et al., 2014), Web Vision (Li et al., 2017), and a filtered version of the LAION dataset (Schuhmann et al., 2022) with around 1B images. For the base model, we pretrain it with 1M steps on the full dataset using only image embeddings as the condition, and then finetune the model on a subset of 60M examples (excluding LAION images with aesthetic scores below 7.0) from the original dataset for 200K steps with all conditions enabled. The prior and upsampling models are trained for 1M steps on the full dataset. (Excerpt from Section 3.1 and Table 1 content)