Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Epsilon-VAE: Denoising as Visual Decoding
Authors: Long Zhao, Sanghyun Woo, Ziyu Wan, Yandong Li, Han Zhang, Boqing Gong, Hartwig Adam, Xuhui Jia, Ting Liu
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach by assessing both reconstruction (r FID) and generation quality (FID), comparing it to state-of-the-art autoencoding approaches. Our study systematically examines these components through controlled experiments, demonstrating their impact on achieving a high-performing diffusion-based autoencoder. In the experiments that under the standard configuration (Rombach et al., 2022), our method obtains a 40% improvement in terms of reconstruction quality, leading to 22% better image generation quality. More notably, we achieve 2.3 higher inference throughput by increasing compression rates, while keeping competitive generation quality. We evaluate the effectiveness of ϵ-VAE on image reconstruction and generation tasks using the Image Net (Deng et al., 2009). The VAE formulation by Esser et al. (2021) serves as a strong baseline due to its widespread use in modern image generative models (Rombach et al., 2022; Peebles & Xie, 2023; Esser et al., 2024). We perform controlled experiments to compare reconstruction and generation quality by varying model scale, latent dimension, downsampling rates, and input resolution. |
| Researcher Affiliation | Collaboration | Long Zhao 1 Sanghyun Woo 1 Ziyu Wan 1 2 * Yandong Li 1 Han Zhang 1 Boqing Gong 1 Hartwig Adam 1 Xuhui Jia 1 Ting Liu 1 1Google Deep Mind 2City University of Hong Kong. Correspondence to: Long Zhao <EMAIL>, Xuhui Jia <EMAIL>, Ting Liu <EMAIL>. |
| Pseudocode | No | The paper describes methods using mathematical equations and textual descriptions but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: "All models are implemented in JAX/Flax (Bradbury et al., 2018; Heek et al., 2024) and trained on TPU-v5lite pods." However, it does not provide an explicit statement about releasing the code for their method or a link to a code repository. |
| Open Datasets | Yes | We evaluate the effectiveness of ϵ-VAE on image reconstruction and generation tasks using the Image Net (Deng et al., 2009). We evaluate r FID, PSNR and SSIM on the full validation sets of Image Net and COCO-2017 (Lin et al., 2014). |
| Dataset Splits | Yes | We evaluate r FID, PSNR and SSIM on the full validation sets of Image Net and COCO-2017 (Lin et al., 2014), with the results summarized in Tab. 2. |
| Hardware Specification | Yes | All models are implemented in JAX/Flax (Bradbury et al., 2018; Heek et al., 2024) and trained on TPU-v5lite pods. Inference throughputs are computed on a Tesla H100 GPU. |
| Software Dependencies | No | All models are implemented in JAX/Flax (Bradbury et al., 2018; Heek et al., 2024) and trained on TPU-v5lite pods. The paper mentions software frameworks but does not specify their version numbers. |
| Experiment Setup | Yes | The autoencoder loss follows Eq. 1, with weights set to λLPIPS = 0.5 and λadv = 0.5. We use the Adam optimizer (Kingma & Ba, 2015) with β1 = 0 and β2 = 0.999, applying a linear learning rate warmup over the first 5,000 steps, followed by a constant rate of 0.0001 for a total of one million steps. The batch size is 256, with data augmentations including random cropping and horizontal flipping. We follow the setting in Peebles & Xie (2023) to train the latent diffusion models for unconditional image generation on the Image Net dataset. The Di T-XL/2 architecture is used for all experiments. The diffusion hyperparameters from ADM (Dhariwal & Nichol, 2021) are kept. To be specific, we use a tmax = 1000 linear variance schedule ranging from 0.0001 to 0.02, and results are generated using 250 DDPM sampling steps. All models are trained with Adam (Kingma & Ba, 2015) with no weight decay. We use a constant learning rate of 0.0001 and a batch size of 256. Horizontal flipping and random cropping are used for data augmentation. We maintain an exponential moving average of Di T weights over training with a decay of 0.9999. We use identical training hyperparameters across all experiments and train models for one million steps in total. No classifier-free guidance (Ho & Salimans, 2022) is employed in all the experiments. |