Vector-quantized Image Modeling with Improved VQGAN

Authors: Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The evaluations of our proposed Vi T-VQGAN and VIM are studied with three aspects. (1) We evaluate the image quantizer based on reconstruction quality metrics including ℓ1 distance, ℓ2 distance, log-laplace distance, as well as Inception Score (IS) and Fr echet Inception Distance (FID) of reconstructed images. (2) We evaluate the capabilities of the learned quantizer for unconditional or class-conditioned image synthesis based on FID and IS, and compare with other methods. (3) We rely on linear-probe accuracy to evaluate representations with the common intuition that good features should linearly separate the classes of downstream tasks.
Researcher Affiliation Industry Jiahui Yu Xin Li Jing Yu Koh Han Zhang Ruoming Pang James Qin Alexander Ku Yuanzhong Xu Jason Baldridge Yonghui Wu Google Research jiahuiyu@google.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper provides links to external projects (e.g., "1https://github.com/Comp Vis/taming-transformers" and "2https://github.com/openai/dall-e") which are related to datasets or prior work (VQGAN, DALL-E) but does not provide a specific link or explicit statement about releasing the source code for the methodology described in this paper.
Open Datasets Yes We train the proposed Vi T-VQGAN on three datasets separately, Celeb A-HQ (Karras et al., 2019), FFHQ (Karras et al., 2019), and Image Net (Krizhevsky et al., 2012).
Dataset Splits Yes For Celeb A-HQ and FFHQ, we follow the default train and validation split as VQGAN (Esser et al., 2021).1
Hardware Specification Yes Throughputs are benchmarked with the same 128 Cloud TPUv4 devices. [...] All models are trained with an input image resolution 256 256 on Cloud TPUv4.
Software Dependencies No The paper mentions optimizers like "Adam optimizer (Kingma & Ba, 2014)" and "Adafactor (Shazeer & Stern, 2018)" but does not specify any software libraries (e.g., TensorFlow, PyTorch) or their version numbers that would be necessary to replicate the experiments.
Experiment Setup Yes We train all Vi T-VQGAN models with a training batch size of 256 distributed across 128 Cloud TPUv4 for a total 500,000 training steps. For both Vi T-VQGAN and Style GAN discriminator, Adam optimizer (Kingma & Ba, 2014) is used with β1 = 0.9 and β2 = 0.99 with the learning rate linearly warming up to a peak value of 1 10 4 over 50,000 steps and then decaying to 5 10 5 over the remaining 450,000 steps with a cosine schedule. We use a decoupled weight decay (Loshchilov & Hutter, 2017) of 1 10 4 for both Vi T-VQGAN and Style GAN discriminator. [...] Models are trained with a global training batch size of 1024 for a total of 450,000 training steps. We use Adam optimizer (Kingma & Ba, 2014) with β1 = 0.9 and β2 = 0.96 with the learning rate linearly warming up to a peak constant value of 4.5 10 4 over the first 5000 steps and then exponentially decaying to 1 10 5 starting from 80,000 steps. [...] A dropout ratio of 0.1 is used in all residual, activation and attention outputs.