Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis

Authors: Jiatao Gu, Tianrong Chen, David Berthelot, Huangjie Zheng, Yuyang Wang, Ruixiang Zhang, Laurent Dinh, Miguel Angel Bautista, Joshua Susskind, Shuangfei Zhai

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental STARFlow achieves competitive results in both class- and text-conditional image generation, with sample quality approaching that of state-of-the-art diffusion models. To our knowledge, this is the first successful demonstration of normalizing flows at this scale and resolution. Code and weights available at https://github.com/apple/ml-starflow. 4 Experiments
Researcher Affiliation Industry Jiatao Gu, Tianrong Chen, David Berthelot, Huangjie Zheng, Yuyang Wang, Ruixiang Zhang, Laurent Dinh, Miguel Angel Bautista, Josh Susskind, Shuangfei Zhai Apple EMAIL
Pseudocode Yes Algorithm 1 Sampling from STARFlow Models
Open Source Code Yes Code and weights available at https://github.com/apple/ml-starflow.
Open Datasets Yes Dataset We experiment with STARFlow on both class-conditioned and text-to-image generation tasks. For the former, we conduct experiments on Image Net-1K (Deng et al., 2009) including 256 ร— 256 and 512 ร— 512 resolutions. For text-to-image, we show two settings: a constrained setting CC12M (Changpinyo et al., 2021), where each image is accompanied by a synthetic caption following (Gu et al., 2024a). We also demonstrated a scaled setting where our models trained an in-house dataset with CC12M, in total 700M text-image pairs. Evaluation In line with prior works, we report Frรฉchet Inception Distance (FID) (Heusel et al., 2017) to quantify the the realism and diversity of generated images. For text-to-image generation, we use MSCOCO 2017 (Lin et al., 2014) validation set to assess the zero-shot capabilities of these models.
Dataset Splits Yes For text-to-image generation, we use MSCOCO 2017 (Lin et al., 2014) validation set to assess the zero-shot capabilities of these models. Dataset We experiment with STARFlow on both class-conditioned and text-to-image generation tasks. For the former, we conduct experiments on Image Net-1K (Deng et al., 2009) including 256 ร— 256 and 512 ร— 512 resolutions.
Hardware Specification Yes Models are trained on 32 (for 1.4B model) or 64 (for 3.8B model) H100 GPUs for around 2 weeks.
Software Dependencies No The paper mentions "optimizer= Adam W" but does not specify version numbers for any key software components, libraries, or programming languages.
Experiment Setup Yes training config: batch_size=512 optimizer= Adam W adam_beta1=0.9 adam_beta2=0.95 adam_eps=1e-8 learning_rate=1e-4 min_learning_rate=1e-6 learning_rate_schedule=cosine weight_decay=1e-4 max_training_images=400M mixed_precision_training=bf16