PLay: Parametrically Conditioned Layout Generation using Latent Diffusion

Authors: Chin-Yi Cheng, Forrest Huang, Gang Li, Yang Li

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method outperforms prior works across three datasets on metrics including FID and FD-VG, and in user study.
Researcher Affiliation Industry 1Google Research, Mountain View, United States. Correspondence to: Chin-Yi Cheng <cchinyi@google.com>, Yang Li <liyang@google.com>.
Pseudocode No No explicit pseudocode or algorithm blocks are provided. The model components are described visually in Figure 14 and processes are explained in text.
Open Source Code No The paper does not contain an explicit statement that the authors are releasing their code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes We experiment PLay with three publicly available datasets for two different domains: UI and document layouts. CLAY (Li et al., 2022) contains about 50K UI layouts with 24 classes. RICO-Semantic (Liu et al., 2018) contains about 43K UI layouts with 13 classes previously used in VTN. Publay Net (Zhong et al., 2019) contains about 330K document layouts with 5 classes.
Dataset Splits No The paper mentions using CLAY, RICO-Semantic, and Publay Net datasets but does not explicitly provide specific percentages or counts for training, validation, or test splits. It only mentions 'sample size s = 1024' for metric computation, which is not a dataset split.
Hardware Specification Yes The model is trained using 8 Google Cloud TPU v4 cores for 47 hours.
Software Dependencies No We implemented the proposed architecture in JAX and Flax. We use ADAM optimizer (b1 = 0.9, b2 = 0.98) with 500k steps and a batch size of 128. The learning rate is 0.001 with linear warming up to 8k steps.
Experiment Setup Yes We use ADAM optimizer (b1 = 0.9, b2 = 0.98) with 500k steps and a batch size of 128. The learning rate is 0.001 with linear warming up to 8k steps. For the denoise network ϵθ(zt, τψ(G), t), we use a Transformer encoder to replace the U-Net structure used in image-based DMs and predict the noise ϵ. We also added a small KL-penalty to regularize the latent space while keeping the high reconstruction accuracy. In sampling, DDPM and CFG, with w = 1.5. we also found that discrete coordinate values work better empirically and set the dimension of each layout with width = 36 and height = 64. We fix the maximum number of elements per layout: N = 128, and the layout with fewer elements are padded to the same size, which result in fixed N and D for all layouts. We fix the maximum number of guidelines for each layout: M = 128