Complexity Matters: Rethinking the Latent Space for Generative Modeling

Authors: Tianyang Hu, Fei Chen, Haonan Wang, Jiawei Li, Wenjia Wang, Jiacheng Sun, Zhenguo Li

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theoretical analyses are corroborated by comprehensive experiments on various models such as VQGAN [21] and Diffusion Transformer [60], where our modifications yield significant improvements in sample quality with decreased model complexity.
Researcher Affiliation Collaboration 1 Huawei Noah s Ark Lab, 2 National University of Singapore 3 Hong Kong University of Science and Technology (Guangzhou)
Pseudocode No Figure 1 illustrates the method overview where the two stages of DAE can be summarized as: DAE Stage 1: Train bf F with a small decoder GA to learn a good latent. DAE Stage 2: Freeze the trained encoder bf. Then train the regular decoder g G to ensure good generation performance.
Open Source Code No We use the official VQGAN implementation7 and model architectures for Faces HQ. The paper references third-party implementations but does not explicitly state that their own code for the methodology is available.
Open Datasets Yes We conduct empirical evaluations of our proposed DAE training scheme on a variety of datasets and generative models, from toy Gaussian mixture data to DCGAN on CIFAR-10, to VQGAN and Di T on larger datasets. ... Faces HQ dataset, which is a combination of two face datasets Celeb AHQ [51] and FFHQ [43]... Image Net dataset [18]... Open Images [46] dataset
Dataset Splits Yes We evaluate our DAE modifications to VQGAN on the Faces HQ dataset, which is a combination of two face datasets Celeb AHQ and FFHQ, with 85k training images and 15k validation images in total (Table 5).
Hardware Specification Yes All experiments are run on eight V100 GPUs.
Software Dependencies No We use the official VQGAN implementation7 and model architectures for Faces HQ. ... Following the same setup as the official implementation, the EMA rate is 0.9999 and the classifier-free guidance scale is 4. The paper mentions implementations but not specific software versions for reproducibility.
Experiment Setup Yes For training the encoder and decoder, the learning rate is 4.5 10 6, the batch size is 8 on each GPU (total batch size 64), and the number of training epochs is 80. For training the transformer the learning rate is 2 10 6 and the batch size is 12 on each GPU. ... For the DAE training, we jointly train encoder f and the auxiliary decoder gaux GA in the first stage (first 40 epochs). Then in the second stage (last 40 epochs), we replace gaux with g, and train g from scratch with f (and the trained codebook) fixed. ... Adam W [53] optimizer is employed with a constant learning rate of 10 4 and a weight decay of 3 10 2. The batch size is 1024, and the number of epochs is 120.