Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis

Authors: Jiatao Gu, Tianrong Chen, David Berthelot, Huangjie Zheng, Yuyang Wang, Ruixiang Zhang, Laurent Dinh, Miguel Angel Bautista, Joshua Susskind, Shuangfei Zhai

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	STARFlow achieves competitive results in both class- and text-conditional image generation, with sample quality approaching that of state-of-the-art diffusion models. To our knowledge, this is the first successful demonstration of normalizing flows at this scale and resolution. Code and weights available at https://github.com/apple/ml-starflow. 4 Experiments
Researcher Affiliation	Industry	Jiatao Gu, Tianrong Chen, David Berthelot, Huangjie Zheng, Yuyang Wang, Ruixiang Zhang, Laurent Dinh, Miguel Angel Bautista, Josh Susskind, Shuangfei Zhai Apple EMAIL
Pseudocode	Yes	Algorithm 1 Sampling from STARFlow Models
Open Source Code	Yes	Code and weights available at https://github.com/apple/ml-starflow.
Open Datasets	Yes	Dataset We experiment with STARFlow on both class-conditioned and text-to-image generation tasks. For the former, we conduct experiments on Image Net-1K (Deng et al., 2009) including 256 × 256 and 512 × 512 resolutions. For text-to-image, we show two settings: a constrained setting CC12M (Changpinyo et al., 2021), where each image is accompanied by a synthetic caption following (Gu et al., 2024a). We also demonstrated a scaled setting where our models trained an in-house dataset with CC12M, in total 700M text-image pairs. Evaluation In line with prior works, we report Fréchet Inception Distance (FID) (Heusel et al., 2017) to quantify the the realism and diversity of generated images. For text-to-image generation, we use MSCOCO 2017 (Lin et al., 2014) validation set to assess the zero-shot capabilities of these models.
Dataset Splits	Yes	For text-to-image generation, we use MSCOCO 2017 (Lin et al., 2014) validation set to assess the zero-shot capabilities of these models. Dataset We experiment with STARFlow on both class-conditioned and text-to-image generation tasks. For the former, we conduct experiments on Image Net-1K (Deng et al., 2009) including 256 × 256 and 512 × 512 resolutions.
Hardware Specification	Yes	Models are trained on 32 (for 1.4B model) or 64 (for 3.8B model) H100 GPUs for around 2 weeks.
Software Dependencies	No	The paper mentions "optimizer= Adam W" but does not specify version numbers for any key software components, libraries, or programming languages.
Experiment Setup	Yes	training config: batch_size=512 optimizer= Adam W adam_beta1=0.9 adam_beta2=0.95 adam_eps=1e-8 learning_rate=1e-4 min_learning_rate=1e-6 learning_rate_schedule=cosine weight_decay=1e-4 max_training_images=400M mixed_precision_training=bf16