Generating compositional scenes via Text-to-image RGBA Instance Generation

Authors: Alessandro Fontanella, Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Sarah Parisot

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes. Through multi-layer composition, we demonstrate that our approach allows to build and manipulate images from highly complex prompts with finegrained control over object appearance and location, granting a higher degree of control than competing methods. ... 4 Experiments
Researcher Affiliation Collaboration Alessandro Fontanella University of Edinburgh Petru-Daniel Tudosiu Huawei Noah s Ark Lab Yongxin Yang Huawei Noah s Ark Lab Shifeng Zhang Huawei Noah s Ark Lab Sarah Parisot Microsoft Research
Pseudocode Yes An algorithmic overview is provided in Appendix B. ... Algorithm 1: Multi-layer scene composition
Open Source Code No Open sourcing of our code will depend on internal approval.
Open Datasets Yes We employed 87989 instances from the Mu LAn dataset [47], a novel dataset consisting of automatically generated RGBA decompositions with a diverse array of scenes, styles and object categories, and 15791 instances extracted from a variety of image matting datasets with high quality masks. ... The Mu LAn [47] dataset... The dataset used is available at the following url: https://huggingface.co/datasets/mulan-dataset/v1.0.
Dataset Splits No In order to find the best parameters for the classifier-free Guidance Scale (GS) and Guidance Rescaling (GR), we perform a grid search on validation data, measuring KID to evaluate image quality. (However, the paper does not specify the splitting methodology or size of this validation data.)
Hardware Specification No The paper mentions "trained in mixed precision with a batch size of 12" which implies GPU usage, but does not specify any particular GPU model, CPU, or detailed computer specifications.
Software Dependencies No The paper mentions optimizers (Adam, AdamW), models (Pix Art-α, Flan-T5-XXL, Phi-3), and samplers (DPM-Solver, PNDM) but does not provide specific version numbers for core software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes Images were resized with bilinear interpolation then centre cropped to obtain a 512 512 image. ... The VAE was trained for a total of 23 epochs (with each epoch comprising of 42908 steps performed with a batch size of 2) employing the Adam optimiser with starting learning rate of 4.5e 6, β1 = 0.5 and β2 = 0.9. The discriminator adversarial loss was introduced after 50k steps with a weight of 0.5. ... For the LDM fine-tuning... 200k iterations and a second stage employing 50% of the data... for 80k steps. ... We employed the Adam W optimiser [28] with a starting learning rate of 1e 5 and cosine scheduler. and trained in mixed precision with a batch size of 12. ... guidance scale 2.5 and rescaling parameter 0.25.