Finite Scalar Quantization: VQ-VAE Made Simple

Authors: Fabian Mentzer, David Minnen, Eirikur Agustsson, Michael Tschannen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations. We start with a study, where we train Mask GIT models on lower resolution 128x128 Image Net images and for shorter time compared to the paper Chang et al. (2022)
Researcher Affiliation Industry Fabian Mentzer1, David Minnen1, Eirikur Agustsson1, Michael Tschannen2, 1Google Research 2Google Deep Mind
Pseudocode No The paper refers to 'code in App. A.1' for a specific function, but it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks within the document.
Open Source Code Yes Colab on Git Hub. We refer to Section A.1 for reference code.
Open Datasets Yes We start with a study, where we train Mask GIT models on lower resolution 128x128 Image Net images and for shorter time compared to the paper Chang et al. (2022) (100 epochs for Stage I, 200 epochs for Stage II. Please see Appendix A.4.1 for more hyperparameters). We train Mask GIT models on Image Net 256 based on the public Git Hub code, training Stage I for 1M steps with batch size 512, and Stage II for 2.5M steps with batch size 256. We retrain the public UVi M Git Hub code for all three tasks (panoptic segmentation, depth estimation, colorization).
Dataset Splits Yes Reconstruction FID, the FID obtained by the GAN-trained autoencoder when the 50k validation images are fed through the quantized autoencoder.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory, or number of machines) used to run the experiments.
Software Dependencies No The paper mentions the use of 'ADM Tensor Flow Suite' and 'JAX' in its references, and 'public Git Hub code' for Mask GIT and UVi M, but it does not specify concrete version numbers for any key software components or libraries (e.g., TensorFlow version, PyTorch version, Python version).
Experiment Setup Yes We start with a study, where we train Mask GIT models on lower resolution 128x128 Image Net images and for shorter time compared to the paper Chang et al. (2022) (100 epochs for Stage I, 200 epochs for Stage II. Please see Appendix A.4.1 for more hyperparameters).