Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Bigger is not Always Better: Scaling Properties of Latent Diffusion Models

Authors: Kangfu Mei, Zhengzhong Tu, Mauricio Delbracio, Hossein Talebi, Vishal M. Patel, Peyman Milanfar

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through empirical analysis of established text-to-image diﬀusion models, we conduct an in-depth investigation into how model size inﬂuences sampling eﬃciency across varying sampling steps. Our ﬁndings unveil a surprising trend: when operating under a given inference budget, smaller models frequently outperform their larger equivalents in generating high-quality results.
Researcher Affiliation	Collaboration	Kangfu Mei EMAIL Johns Hopkins University Zhengzhong Tu EMAIL Texas A&M University Mauricio Delbracio EMAIL Google Hossein Talebi EMAIL Google Vishal M. Patel EMAIL Johns Hopkins University Peyman Milanfar EMAIL Google
Pseudocode	No	The paper describes methods and architectural variations in text, and refers to figures for visual representation (e.g., Fig. 2 for U-Net architecture), but does not contain any explicit pseudocode blocks or algorithms.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code or provide a direct link to a code repository for the methodology described.
Open Datasets	Yes	All models were trained on TPUv5 using internal data sources with about 600 million aesthetically-ﬁltered text-to-image pairs. ... We adopted SD v1.5 since it is among the most popular diﬀusion models https://huggingface.co/models?sort=likes. ... trained using the web-scale aesthetically ﬁltered text-to-image dataset, i.e., Web LI (Chen et al., 2022). ... we test the text-to-image performance of scaled models on the validation set of COCO 2014 (Lin et al., 2014) with 30k samples. For downstream performance, speciﬁcally real-world super-resolution, we test the performance of scaled models on the validation of DIV2K with 3k randomly cropped patches, which are degraded with the Real ESRGAN degradation (Wang et al., 2021).
Dataset Splits	Yes	In order to evaluate the performance of the scaled models, we test the text-to-image performance of scaled models on the validation set of COCO 2014 (Lin et al., 2014) with 30k samples. For downstream performance, speciﬁcally real-world super-resolution, we test the performance of scaled models on the validation of DIV2K with 3k randomly cropped patches, which are degraded with the Real ESRGAN degradation (Wang et al., 2021).
Hardware Specification	Yes	All models were trained on TPUv5 using internal data sources with about 600 million aesthetically-ﬁltered text-to-image pairs.
Software Dependencies	No	The paper mentions using 'Stable Diﬀusion v1.5 standard' and various samplers (DDIM, DDPM, DPM-Solver++) but does not specify version numbers for programming languages, libraries, or frameworks used for implementation (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	All the models are trained for 500K steps, batch size 2048, and learning rate 1e-4. ... We used the common practice of 50 sampling steps with the DDIM sampler, 7.5 classiﬁer-free guidance rate, for text-to-image generation. ... We demonstrate this by sampling the scaled models using diﬀerent CFG rates, i.e., (1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and comparing their quantitative and qualitative results.