Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bigger is not Always Better: Scaling Properties of Latent Diffusion Models

Authors: Kangfu Mei, Zhengzhong Tu, Mauricio Delbracio, Hossein Talebi, Vishal M. Patel, Peyman Milanfar

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through empirical analysis of established text-to-image diffusion models, we conduct an in-depth investigation into how model size influences sampling efficiency across varying sampling steps. Our findings unveil a surprising trend: when operating under a given inference budget, smaller models frequently outperform their larger equivalents in generating high-quality results.
Researcher Affiliation Collaboration Kangfu Mei EMAIL Johns Hopkins University Zhengzhong Tu EMAIL Texas A&M University Mauricio Delbracio EMAIL Google Hossein Talebi EMAIL Google Vishal M. Patel EMAIL Johns Hopkins University Peyman Milanfar EMAIL Google
Pseudocode No The paper describes methods and architectural variations in text, and refers to figures for visual representation (e.g., Fig. 2 for U-Net architecture), but does not contain any explicit pseudocode blocks or algorithms.
Open Source Code No The paper does not contain an explicit statement about releasing source code or provide a direct link to a code repository for the methodology described.
Open Datasets Yes All models were trained on TPUv5 using internal data sources with about 600 million aesthetically-filtered text-to-image pairs. ... We adopted SD v1.5 since it is among the most popular diffusion models https://huggingface.co/models?sort=likes. ... trained using the web-scale aesthetically filtered text-to-image dataset, i.e., Web LI (Chen et al., 2022). ... we test the text-to-image performance of scaled models on the validation set of COCO 2014 (Lin et al., 2014) with 30k samples. For downstream performance, specifically real-world super-resolution, we test the performance of scaled models on the validation of DIV2K with 3k randomly cropped patches, which are degraded with the Real ESRGAN degradation (Wang et al., 2021).
Dataset Splits Yes In order to evaluate the performance of the scaled models, we test the text-to-image performance of scaled models on the validation set of COCO 2014 (Lin et al., 2014) with 30k samples. For downstream performance, specifically real-world super-resolution, we test the performance of scaled models on the validation of DIV2K with 3k randomly cropped patches, which are degraded with the Real ESRGAN degradation (Wang et al., 2021).
Hardware Specification Yes All models were trained on TPUv5 using internal data sources with about 600 million aesthetically-filtered text-to-image pairs.
Software Dependencies No The paper mentions using 'Stable Diffusion v1.5 standard' and various samplers (DDIM, DDPM, DPM-Solver++) but does not specify version numbers for programming languages, libraries, or frameworks used for implementation (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes All the models are trained for 500K steps, batch size 2048, and learning rate 1e-4. ... We used the common practice of 50 sampling steps with the DDIM sampler, 7.5 classifier-free guidance rate, for text-to-image generation. ... We demonstrate this by sampling the scaled models using different CFG rates, i.e., (1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and comparing their quantitative and qualitative results.