Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Bigger is not Always Better: Scaling Properties of Latent Diffusion Models
Authors: Kangfu Mei, Zhengzhong Tu, Mauricio Delbracio, Hossein Talebi, Vishal M. Patel, Peyman Milanfar
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through empirical analysis of established text-to-image diffusion models, we conduct an in-depth investigation into how model size influences sampling efficiency across varying sampling steps. Our findings unveil a surprising trend: when operating under a given inference budget, smaller models frequently outperform their larger equivalents in generating high-quality results. |
| Researcher Affiliation | Collaboration | Kangfu Mei EMAIL Johns Hopkins University Zhengzhong Tu EMAIL Texas A&M University Mauricio Delbracio EMAIL Google Hossein Talebi EMAIL Google Vishal M. Patel EMAIL Johns Hopkins University Peyman Milanfar EMAIL Google |
| Pseudocode | No | The paper describes methods and architectural variations in text, and refers to figures for visual representation (e.g., Fig. 2 for U-Net architecture), but does not contain any explicit pseudocode blocks or algorithms. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or provide a direct link to a code repository for the methodology described. |
| Open Datasets | Yes | All models were trained on TPUv5 using internal data sources with about 600 million aesthetically-filtered text-to-image pairs. ... We adopted SD v1.5 since it is among the most popular diffusion models https://huggingface.co/models?sort=likes. ... trained using the web-scale aesthetically filtered text-to-image dataset, i.e., Web LI (Chen et al., 2022). ... we test the text-to-image performance of scaled models on the validation set of COCO 2014 (Lin et al., 2014) with 30k samples. For downstream performance, specifically real-world super-resolution, we test the performance of scaled models on the validation of DIV2K with 3k randomly cropped patches, which are degraded with the Real ESRGAN degradation (Wang et al., 2021). |
| Dataset Splits | Yes | In order to evaluate the performance of the scaled models, we test the text-to-image performance of scaled models on the validation set of COCO 2014 (Lin et al., 2014) with 30k samples. For downstream performance, specifically real-world super-resolution, we test the performance of scaled models on the validation of DIV2K with 3k randomly cropped patches, which are degraded with the Real ESRGAN degradation (Wang et al., 2021). |
| Hardware Specification | Yes | All models were trained on TPUv5 using internal data sources with about 600 million aesthetically-filtered text-to-image pairs. |
| Software Dependencies | No | The paper mentions using 'Stable Diffusion v1.5 standard' and various samplers (DDIM, DDPM, DPM-Solver++) but does not specify version numbers for programming languages, libraries, or frameworks used for implementation (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | All the models are trained for 500K steps, batch size 2048, and learning rate 1e-4. ... We used the common practice of 50 sampling steps with the DDIM sampler, 7.5 classifier-free guidance rate, for text-to-image generation. ... We demonstrate this by sampling the scaled models using different CFG rates, i.e., (1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and comparing their quantitative and qualitative results. |