LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models

Authors: Seyedmorteza Sadat, Jakob Buhmann, Derek Bradley, Otmar Hilliges, Romann M Weber

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experimentation, we show that Lite VAE considerably reduces the computational cost of the standard VAE encoder while maintaining the same level of reconstruction quality.Our base Lite VAE model matches the quality of the established VAEs in current LDMs with a six-fold reduction in encoder parameters, leading to faster training and lower GPU memory requirements, while our larger model outperforms VAEs of comparable complexity across all evaluated metrics (r FID, LPIPS, PSNR, and SSIM).
Researcher Affiliation Collaboration 1ETH Zürich, 2Disney Research|Studios
Pseudocode Yes G Pseudocode for different Lite VAE blocks
Open Source Code No While we do not provide open access to our codebase, the hyperparameters, algorithms, and implementation details are provided in the appendix to ensure reproducibility.
Open Datasets Yes FFHQ [30] (256 256), Image Net [57]
Dataset Splits No The paper uses standard public datasets like FFHQ and ImageNet but does not explicitly detail training, validation, and test splits with percentages or sample counts.
Hardware Specification Yes The values are measured on one Quadro RTX 6000.
Software Dependencies No Our implementation of the UNet used for feature extraction and aggregation closely follows the ADM model [10] without spatial down/upsampling layers. We use Adam optimizer [34] with a learning rate of 10 4 and (β1, β2) = (0.5, 0.9). (Specific version numbers for software libraries are not provided).
Experiment Setup Yes All models were trained with a batch size of 16 on two GPUs until the autoencoder could produce high-quality reconstructions. The training duration was 200k steps for the Image Net 128 128 models, and 100k for the Image Net 256 256 and FFHQ models. We use Adam optimizer [34] with a learning rate of 10 4 and (β1, β2) = (0.5, 0.9).