Composer: Creative and Controllable Image Synthesis with Composable Conditions
Authors: Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, Jingren Zhou
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Despite being trained in a multitask manner, Composer achieves a zero-shot FID of 9.2 in text-to-image synthesis on the COCO dataset (Lin et al., 2014) when using only caption as the condition, indicating its ability to produce high-quality results. We train a 2B parameter base model for conditional image generation at 64 64 resolution, a 1.1B parameter model for upscaling images to 256 256 resolution, and a 300M parameter model for further upscaling images to 1024 1024 resolution. We conduct user studies to evaluate the performance of four pretrained models trained using different settings on five generation tasks. |
| Researcher Affiliation | Industry | 1Alibaba Group 2Ant Group. Correspondence to: Lianghua Huang, Di Chen, Yu Liu <xuangen.hlh, guangpan.cd, ly103369@alibaba-inc.com>, Yujun Shen, Deli Zhao <shenyujun0302, zhaodeli@gmail.com>, Jingren Zhou <jingren.zhou@alibaba-inc.com>. |
| Pseudocode | No | The paper describes the system architecture and processes but does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | No | Code and models will be made available. |
| Open Datasets | Yes | We train on a combination of public datasets, including Image Net21K (Russakovsky et al., 2014), Web Vision (Li et al., 2017), and a filtered version of the LAION dataset (Schuhmann et al., 2022) with around 1B images. |
| Dataset Splits | No | The paper mentions training on a combination of public datasets and fine-tuning on a subset of LAION, but it does not specify explicit training/validation/test splits (e.g., percentages or sample counts) for reproducibility. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments. |
| Software Dependencies | No | The paper mentions leveraging architectures and models like GLIDE, CLIP, and YOLOv5, and using DPM-Solver++, but it does not specify explicit version numbers for software dependencies or libraries required for reproduction. |
| Experiment Setup | Yes | We train a 2B parameter base model for conditional image generation at 64 64 resolution, a 1.1B parameter model for upscaling images to 256 256 resolution, and a 300M parameter model for further upscaling images to 1024 1024 resolution. Additionally, we trained a 1B parameter prior model for optionally projecting captions to image embeddings. We use batch sizes of 4096, 1024, 512, and 512 for the prior, base, and two upsampling models, respectively. We train on a combination of public datasets, including Image Net21K (Russakovsky et al., 2014), Web Vision (Li et al., 2017), and a filtered version of the LAION dataset (Schuhmann et al., 2022) with around 1B images. For the base model, we pretrain it with 1M steps on the full dataset using only image embeddings as the condition, and then finetune the model on a subset of 60M examples (excluding LAION images with aesthetic scores below 7.0) from the original dataset for 200K steps with all conditions enabled. The prior and upsampling models are trained for 1M steps on the full dataset. (Excerpt from Section 3.1 and Table 1 content) |