LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models
Authors: Seyedmorteza Sadat, Jakob Buhmann, Derek Bradley, Otmar Hilliges, Romann M Weber
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experimentation, we show that Lite VAE considerably reduces the computational cost of the standard VAE encoder while maintaining the same level of reconstruction quality.Our base Lite VAE model matches the quality of the established VAEs in current LDMs with a six-fold reduction in encoder parameters, leading to faster training and lower GPU memory requirements, while our larger model outperforms VAEs of comparable complexity across all evaluated metrics (r FID, LPIPS, PSNR, and SSIM). |
| Researcher Affiliation | Collaboration | 1ETH Zürich, 2Disney Research|Studios |
| Pseudocode | Yes | G Pseudocode for different Lite VAE blocks |
| Open Source Code | No | While we do not provide open access to our codebase, the hyperparameters, algorithms, and implementation details are provided in the appendix to ensure reproducibility. |
| Open Datasets | Yes | FFHQ [30] (256 256), Image Net [57] |
| Dataset Splits | No | The paper uses standard public datasets like FFHQ and ImageNet but does not explicitly detail training, validation, and test splits with percentages or sample counts. |
| Hardware Specification | Yes | The values are measured on one Quadro RTX 6000. |
| Software Dependencies | No | Our implementation of the UNet used for feature extraction and aggregation closely follows the ADM model [10] without spatial down/upsampling layers. We use Adam optimizer [34] with a learning rate of 10 4 and (β1, β2) = (0.5, 0.9). (Specific version numbers for software libraries are not provided). |
| Experiment Setup | Yes | All models were trained with a batch size of 16 on two GPUs until the autoencoder could produce high-quality reconstructions. The training duration was 200k steps for the Image Net 128 128 models, and 100k for the Image Net 256 256 and FFHQ models. We use Adam optimizer [34] with a learning rate of 10 4 and (β1, β2) = (0.5, 0.9). |