Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis
Authors: Jiatao Gu, Tianrong Chen, David Berthelot, Huangjie Zheng, Yuyang Wang, Ruixiang Zhang, Laurent Dinh, Miguel Angel Bautista, Joshua Susskind, Shuangfei Zhai
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | STARFlow achieves competitive results in both class- and text-conditional image generation, with sample quality approaching that of state-of-the-art diffusion models. To our knowledge, this is the first successful demonstration of normalizing flows at this scale and resolution. Code and weights available at https://github.com/apple/ml-starflow. 4 Experiments |
| Researcher Affiliation | Industry | Jiatao Gu, Tianrong Chen, David Berthelot, Huangjie Zheng, Yuyang Wang, Ruixiang Zhang, Laurent Dinh, Miguel Angel Bautista, Josh Susskind, Shuangfei Zhai Apple EMAIL |
| Pseudocode | Yes | Algorithm 1 Sampling from STARFlow Models |
| Open Source Code | Yes | Code and weights available at https://github.com/apple/ml-starflow. |
| Open Datasets | Yes | Dataset We experiment with STARFlow on both class-conditioned and text-to-image generation tasks. For the former, we conduct experiments on Image Net-1K (Deng et al., 2009) including 256 ร 256 and 512 ร 512 resolutions. For text-to-image, we show two settings: a constrained setting CC12M (Changpinyo et al., 2021), where each image is accompanied by a synthetic caption following (Gu et al., 2024a). We also demonstrated a scaled setting where our models trained an in-house dataset with CC12M, in total 700M text-image pairs. Evaluation In line with prior works, we report Frรฉchet Inception Distance (FID) (Heusel et al., 2017) to quantify the the realism and diversity of generated images. For text-to-image generation, we use MSCOCO 2017 (Lin et al., 2014) validation set to assess the zero-shot capabilities of these models. |
| Dataset Splits | Yes | For text-to-image generation, we use MSCOCO 2017 (Lin et al., 2014) validation set to assess the zero-shot capabilities of these models. Dataset We experiment with STARFlow on both class-conditioned and text-to-image generation tasks. For the former, we conduct experiments on Image Net-1K (Deng et al., 2009) including 256 ร 256 and 512 ร 512 resolutions. |
| Hardware Specification | Yes | Models are trained on 32 (for 1.4B model) or 64 (for 3.8B model) H100 GPUs for around 2 weeks. |
| Software Dependencies | No | The paper mentions "optimizer= Adam W" but does not specify version numbers for any key software components, libraries, or programming languages. |
| Experiment Setup | Yes | training config: batch_size=512 optimizer= Adam W adam_beta1=0.9 adam_beta2=0.95 adam_eps=1e-8 learning_rate=1e-4 min_learning_rate=1e-6 learning_rate_schedule=cosine weight_decay=1e-4 max_training_images=400M mixed_precision_training=bf16 |