Vector-quantized Image Modeling with Improved VQGAN
Authors: Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The evaluations of our proposed Vi T-VQGAN and VIM are studied with three aspects. (1) We evaluate the image quantizer based on reconstruction quality metrics including ℓ1 distance, ℓ2 distance, log-laplace distance, as well as Inception Score (IS) and Fr echet Inception Distance (FID) of reconstructed images. (2) We evaluate the capabilities of the learned quantizer for unconditional or class-conditioned image synthesis based on FID and IS, and compare with other methods. (3) We rely on linear-probe accuracy to evaluate representations with the common intuition that good features should linearly separate the classes of downstream tasks. |
| Researcher Affiliation | Industry | Jiahui Yu Xin Li Jing Yu Koh Han Zhang Ruoming Pang James Qin Alexander Ku Yuanzhong Xu Jason Baldridge Yonghui Wu Google Research jiahuiyu@google.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides links to external projects (e.g., "1https://github.com/Comp Vis/taming-transformers" and "2https://github.com/openai/dall-e") which are related to datasets or prior work (VQGAN, DALL-E) but does not provide a specific link or explicit statement about releasing the source code for the methodology described in this paper. |
| Open Datasets | Yes | We train the proposed Vi T-VQGAN on three datasets separately, Celeb A-HQ (Karras et al., 2019), FFHQ (Karras et al., 2019), and Image Net (Krizhevsky et al., 2012). |
| Dataset Splits | Yes | For Celeb A-HQ and FFHQ, we follow the default train and validation split as VQGAN (Esser et al., 2021).1 |
| Hardware Specification | Yes | Throughputs are benchmarked with the same 128 Cloud TPUv4 devices. [...] All models are trained with an input image resolution 256 256 on Cloud TPUv4. |
| Software Dependencies | No | The paper mentions optimizers like "Adam optimizer (Kingma & Ba, 2014)" and "Adafactor (Shazeer & Stern, 2018)" but does not specify any software libraries (e.g., TensorFlow, PyTorch) or their version numbers that would be necessary to replicate the experiments. |
| Experiment Setup | Yes | We train all Vi T-VQGAN models with a training batch size of 256 distributed across 128 Cloud TPUv4 for a total 500,000 training steps. For both Vi T-VQGAN and Style GAN discriminator, Adam optimizer (Kingma & Ba, 2014) is used with β1 = 0.9 and β2 = 0.99 with the learning rate linearly warming up to a peak value of 1 10 4 over 50,000 steps and then decaying to 5 10 5 over the remaining 450,000 steps with a cosine schedule. We use a decoupled weight decay (Loshchilov & Hutter, 2017) of 1 10 4 for both Vi T-VQGAN and Style GAN discriminator. [...] Models are trained with a global training batch size of 1024 for a total of 450,000 training steps. We use Adam optimizer (Kingma & Ba, 2014) with β1 = 0.9 and β2 = 0.96 with the learning rate linearly warming up to a peak constant value of 4.5 10 4 over the first 5000 steps and then exponentially decaying to 1 10 5 starting from 80,000 steps. [...] A dropout ratio of 0.1 is used in all residual, activation and attention outputs. |