PLay: Parametrically Conditioned Layout Generation using Latent Diffusion
Authors: Chin-Yi Cheng, Forrest Huang, Gang Li, Yang Li
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method outperforms prior works across three datasets on metrics including FID and FD-VG, and in user study. |
| Researcher Affiliation | Industry | 1Google Research, Mountain View, United States. Correspondence to: Chin-Yi Cheng <cchinyi@google.com>, Yang Li <liyang@google.com>. |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are provided. The model components are described visually in Figure 14 and processes are explained in text. |
| Open Source Code | No | The paper does not contain an explicit statement that the authors are releasing their code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We experiment PLay with three publicly available datasets for two different domains: UI and document layouts. CLAY (Li et al., 2022) contains about 50K UI layouts with 24 classes. RICO-Semantic (Liu et al., 2018) contains about 43K UI layouts with 13 classes previously used in VTN. Publay Net (Zhong et al., 2019) contains about 330K document layouts with 5 classes. |
| Dataset Splits | No | The paper mentions using CLAY, RICO-Semantic, and Publay Net datasets but does not explicitly provide specific percentages or counts for training, validation, or test splits. It only mentions 'sample size s = 1024' for metric computation, which is not a dataset split. |
| Hardware Specification | Yes | The model is trained using 8 Google Cloud TPU v4 cores for 47 hours. |
| Software Dependencies | No | We implemented the proposed architecture in JAX and Flax. We use ADAM optimizer (b1 = 0.9, b2 = 0.98) with 500k steps and a batch size of 128. The learning rate is 0.001 with linear warming up to 8k steps. |
| Experiment Setup | Yes | We use ADAM optimizer (b1 = 0.9, b2 = 0.98) with 500k steps and a batch size of 128. The learning rate is 0.001 with linear warming up to 8k steps. For the denoise network ϵθ(zt, τψ(G), t), we use a Transformer encoder to replace the U-Net structure used in image-based DMs and predict the noise ϵ. We also added a small KL-penalty to regularize the latent space while keeping the high reconstruction accuracy. In sampling, DDPM and CFG, with w = 1.5. we also found that discrete coordinate values work better empirically and set the dimension of each layout with width = 36 and height = 64. We fix the maximum number of elements per layout: N = 128, and the layout with fewer elements are padded to the same size, which result in fixed N and D for all layouts. We fix the maximum number of guidelines for each layout: M = 128 |