Finite Scalar Quantization: VQ-VAE Made Simple
Authors: Fabian Mentzer, David Minnen, Eirikur Agustsson, Michael Tschannen
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations. We start with a study, where we train Mask GIT models on lower resolution 128x128 Image Net images and for shorter time compared to the paper Chang et al. (2022) |
| Researcher Affiliation | Industry | Fabian Mentzer1, David Minnen1, Eirikur Agustsson1, Michael Tschannen2, 1Google Research 2Google Deep Mind |
| Pseudocode | No | The paper refers to 'code in App. A.1' for a specific function, but it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks within the document. |
| Open Source Code | Yes | Colab on Git Hub. We refer to Section A.1 for reference code. |
| Open Datasets | Yes | We start with a study, where we train Mask GIT models on lower resolution 128x128 Image Net images and for shorter time compared to the paper Chang et al. (2022) (100 epochs for Stage I, 200 epochs for Stage II. Please see Appendix A.4.1 for more hyperparameters). We train Mask GIT models on Image Net 256 based on the public Git Hub code, training Stage I for 1M steps with batch size 512, and Stage II for 2.5M steps with batch size 256. We retrain the public UVi M Git Hub code for all three tasks (panoptic segmentation, depth estimation, colorization). |
| Dataset Splits | Yes | Reconstruction FID, the FID obtained by the GAN-trained autoencoder when the 50k validation images are fed through the quantized autoencoder. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory, or number of machines) used to run the experiments. |
| Software Dependencies | No | The paper mentions the use of 'ADM Tensor Flow Suite' and 'JAX' in its references, and 'public Git Hub code' for Mask GIT and UVi M, but it does not specify concrete version numbers for any key software components or libraries (e.g., TensorFlow version, PyTorch version, Python version). |
| Experiment Setup | Yes | We start with a study, where we train Mask GIT models on lower resolution 128x128 Image Net images and for shorter time compared to the paper Chang et al. (2022) (100 epochs for Stage I, 200 epochs for Stage II. Please see Appendix A.4.1 for more hyperparameters). |