Generating compositional scenes via Text-to-image RGBA Instance Generation
Authors: Alessandro Fontanella, Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Sarah Parisot
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes. Through multi-layer composition, we demonstrate that our approach allows to build and manipulate images from highly complex prompts with finegrained control over object appearance and location, granting a higher degree of control than competing methods. ... 4 Experiments |
| Researcher Affiliation | Collaboration | Alessandro Fontanella University of Edinburgh Petru-Daniel Tudosiu Huawei Noah s Ark Lab Yongxin Yang Huawei Noah s Ark Lab Shifeng Zhang Huawei Noah s Ark Lab Sarah Parisot Microsoft Research |
| Pseudocode | Yes | An algorithmic overview is provided in Appendix B. ... Algorithm 1: Multi-layer scene composition |
| Open Source Code | No | Open sourcing of our code will depend on internal approval. |
| Open Datasets | Yes | We employed 87989 instances from the Mu LAn dataset [47], a novel dataset consisting of automatically generated RGBA decompositions with a diverse array of scenes, styles and object categories, and 15791 instances extracted from a variety of image matting datasets with high quality masks. ... The Mu LAn [47] dataset... The dataset used is available at the following url: https://huggingface.co/datasets/mulan-dataset/v1.0. |
| Dataset Splits | No | In order to find the best parameters for the classifier-free Guidance Scale (GS) and Guidance Rescaling (GR), we perform a grid search on validation data, measuring KID to evaluate image quality. (However, the paper does not specify the splitting methodology or size of this validation data.) |
| Hardware Specification | No | The paper mentions "trained in mixed precision with a batch size of 12" which implies GPU usage, but does not specify any particular GPU model, CPU, or detailed computer specifications. |
| Software Dependencies | No | The paper mentions optimizers (Adam, AdamW), models (Pix Art-α, Flan-T5-XXL, Phi-3), and samplers (DPM-Solver, PNDM) but does not provide specific version numbers for core software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Images were resized with bilinear interpolation then centre cropped to obtain a 512 512 image. ... The VAE was trained for a total of 23 epochs (with each epoch comprising of 42908 steps performed with a batch size of 2) employing the Adam optimiser with starting learning rate of 4.5e 6, β1 = 0.5 and β2 = 0.9. The discriminator adversarial loss was introduced after 50k steps with a weight of 0.5. ... For the LDM fine-tuning... 200k iterations and a second stage employing 50% of the data... for 80k steps. ... We employed the Adam W optimiser [28] with a starting learning rate of 1e 5 and cosine scheduler. and trained in mixed precision with a batch size of 12. ... guidance scale 2.5 and rescaling parameter 0.25. |