Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis

Authors: Xihui Liu, Guojun Yin, Jing Shao, Xiaogang Wang, hongsheng Li

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We achieve state-of-the-art results on both quantitative metrics and subjective evaluation on various semantic segmentation datasets, demonstrating the effectiveness of our approach.1 1 Introduction Recently, generative adversarial networks (GAN) [6] have shown stunning results in generating photorealistic images of faces [16, 17] and simple objects [34, 1, 22]. However, generating photorealistic images for complex scenes with different types of objects and stuff remains a challenging problem. We consider semantic image synthesis, which aims at generating photorealistic images conditioned on semantic layouts. It has wide applications on controllable image synthesis and interactive image manipulation. State-of-the-art methods are mostly based on Generative Adversarial Networks (GAN). 1Code is available at https://github.com/xh-liu/CC-FPSE 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. 4 Experiments 4.1 Datasets and Evaluation Metrics We experiment on Cityscapes [5], COCO-Stuff [2], and ADE20K [36] datasets. The Cityscapes dataset has 3,000 training images and 500 validation images of urban street scenes. COCO-Stuff is the most challenging dataset, containing 118,000 training images and 5,000 validation images from complex scenes. ADE20K dataset provides 20,000 training images and 2,000 validation images from both outdoor and indoor scenes. All images are annotated with semantic segmentation masks. We evaluate our approach from three aspects. We firstly compare synthesized images by our approach and previous approaches, and conduct a human perceptual evaluation to compare the visual quality of the generated images. We then evaluate the segmentation performance of the generated images using a segmentation model pretrained on the original datasets. We use the same segmentation models as those in [25] for testing. The segmentation performance is measured by mean Intersection-over-Union (mIOU) and pixel accuracy. Finally, we calculate the distribution distances between the generated images and real images by the Fréchet Inception Distance (FID) [10]. 4.3 Qualitative Results and Human Perceptual Evaluation We compare our results with previous approaches pix2pix HD [29] and SPADE [25], as shown in Figure 3. The images generated by our approach show significant improvement over previous approaches for challenging scenes. They have finer details such as edges and textures, and less artifacts, and matches better with the input semantic layout. Figure 4 shows more images generated by our proposed approach. More results and comparisons are provided in the supplementary material. We also conduct a human perception evaluation to compare the generated image quality between our method and the previous state-of-the-art method, SPADE [25]. In particular, we randomly sample 500 semantic label maps from the validation set of each dataset. At each experiment, the worker is shown a semantic label map with two generated images by our approach and SPADE, respectively. The worker is required to choose an image with higher quality that matches better with the semantic layout. We found that in Cityscapes, COCO-Stuff, and ADE20K datasets respectively, 55%, 76%, and 61% images generated by our method is preferred compared to SPADE. The human perceptual evaluation validates that our approach is able to generate higher-fidelity images that are better spatially aligned with the semantic layout. 4.4 Quantitative Results Table 1 shows the segmentation performance and FID scores of results by our approach and those by previous approaches. CRN [3] uses cascaded refinement networks with regression loss, without using GAN for training. SIMS is a semi-parametric approach which retrieves reference segments from a memory bank and refines the canvas by a refinement network. Both pix2pix HD [29] and SPADE [25] are GAN-based approaches. Pix2pix HD takes the semantic label map as the generator input, and uses a multi-scale generator and multi-scale discriminator to generate high-resolution images. SPADE takes a noise vector as input, and the semantic label map are used for modulating the activations in normalization layers by learned affine transformations. Our approach performs consistently better than previous approaches, which demonstrate the effectiveness of the propose approach. 4.5 Ablation Studies We conduct controlled experiments to verify the effectiveness of each component in our approach. We use the SPADE [25] model as our baseline, and gradually add or eliminate each component to the framework. Our model is denoted as CC-FPSE in the last column. The segmentation mIOU scores of the generated images by each experiment are shown in Table 2.2
Researcher Affiliation Collaboration Xihui Liu The Chinese University of Hong Kong xihuiliu@ee.cuhk.edu.hk Guojun Yin University of Science and Technology of China gjyin91@gmail.com Jing Shao Sense Time Research shaojing@sensetime.com Xiaogang Wang The Chinese University of Hong Kong xgwang@ee.cuhk.edu.hk Hongsheng Li The Chinese University of Hong Kong hsli@ee.cuhk.edu.hk
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Code is available at https://github.com/xh-liu/CC-FPSE
Open Datasets Yes We experiment on Cityscapes [5], COCO-Stuff [2], and ADE20K [36] datasets. The Cityscapes dataset has 3,000 training images and 500 validation images of urban street scenes. COCO-Stuff is the most challenging dataset, containing 118,000 training images and 5,000 validation images from complex scenes. ADE20K dataset provides 20,000 training images and 2,000 validation images from both outdoor and indoor scenes. All images are annotated with semantic segmentation masks.
Dataset Splits Yes The Cityscapes dataset has 3,000 training images and 500 validation images of urban street scenes. COCO-Stuff is the most challenging dataset, containing 118,000 training images and 5,000 validation images from complex scenes. ADE20K dataset provides 20,000 training images and 2,000 validation images from both outdoor and indoor scenes.
Hardware Specification Yes Our models are trained on 16 TITANX GPUs, with a batch size of 32.
Software Dependencies No The paper mentions software components like 'batch normalization', 'Leaky ReLU', 'ADAM optimizer', and 'VGG extracted features', but does not provide specific version numbers for any of these or the underlying frameworks (e.g., PyTorch, TensorFlow, CUDA).
Experiment Setup Yes The training and generated image resolution is 256 x 256 for COCO-Stuff and ADE20K datasets, and 256 x 512 for Cityscapes dataset. For the generator, synchronized batch normalization between different GPUs is adopted for better estimating the batch statistics. For the discriminator, we utilize instance normalization. We use Leaky ReLU activations, to avoid sparse gradients caused by ReLU activation. We adopt ADAM [18] optimizer with learning rate 0.0001 for the generator and 0.0004 for the discriminator. The weights for the perceptual loss λP is 10 and the weight discriminator feature matching loss λF M is 20. Following [25], to enable multi-modal synthesis and style-guided synthesis, we apply a style encoder and a KL-Divergence loss with loss weight 0.05. Our models are trained on 16 TITANX GPUs, with a batch size of 32. We train 200 epochs for Cityscapes and ADE20K datasets, and 100 epochs for COCO-Stuff dataset.