Complexity Matters: Rethinking the Latent Space for Generative Modeling
Authors: Tianyang Hu, Fei Chen, Haonan Wang, Jiawei Li, Wenjia Wang, Jiacheng Sun, Zhenguo Li
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theoretical analyses are corroborated by comprehensive experiments on various models such as VQGAN [21] and Diffusion Transformer [60], where our modifications yield significant improvements in sample quality with decreased model complexity. |
| Researcher Affiliation | Collaboration | 1 Huawei Noah s Ark Lab, 2 National University of Singapore 3 Hong Kong University of Science and Technology (Guangzhou) |
| Pseudocode | No | Figure 1 illustrates the method overview where the two stages of DAE can be summarized as: DAE Stage 1: Train bf F with a small decoder GA to learn a good latent. DAE Stage 2: Freeze the trained encoder bf. Then train the regular decoder g G to ensure good generation performance. |
| Open Source Code | No | We use the official VQGAN implementation7 and model architectures for Faces HQ. The paper references third-party implementations but does not explicitly state that their own code for the methodology is available. |
| Open Datasets | Yes | We conduct empirical evaluations of our proposed DAE training scheme on a variety of datasets and generative models, from toy Gaussian mixture data to DCGAN on CIFAR-10, to VQGAN and Di T on larger datasets. ... Faces HQ dataset, which is a combination of two face datasets Celeb AHQ [51] and FFHQ [43]... Image Net dataset [18]... Open Images [46] dataset |
| Dataset Splits | Yes | We evaluate our DAE modifications to VQGAN on the Faces HQ dataset, which is a combination of two face datasets Celeb AHQ and FFHQ, with 85k training images and 15k validation images in total (Table 5). |
| Hardware Specification | Yes | All experiments are run on eight V100 GPUs. |
| Software Dependencies | No | We use the official VQGAN implementation7 and model architectures for Faces HQ. ... Following the same setup as the official implementation, the EMA rate is 0.9999 and the classifier-free guidance scale is 4. The paper mentions implementations but not specific software versions for reproducibility. |
| Experiment Setup | Yes | For training the encoder and decoder, the learning rate is 4.5 10 6, the batch size is 8 on each GPU (total batch size 64), and the number of training epochs is 80. For training the transformer the learning rate is 2 10 6 and the batch size is 12 on each GPU. ... For the DAE training, we jointly train encoder f and the auxiliary decoder gaux GA in the first stage (first 40 epochs). Then in the second stage (last 40 epochs), we replace gaux with g, and train g from scratch with f (and the trained codebook) fixed. ... Adam W [53] optimizer is employed with a constant learning rate of 10 4 and a weight decay of 3 10 2. The batch size is 1024, and the number of epochs is 120. |