Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Discovering Latent Graphs with GFlowNets for Diverse Conditional Image Generation

Authors: Bailey Trang Nguyen, Parham Saremi, Alan Wang, Fangrui Huang, Zahra TehraniNasab, Amar Kumar, Tal Arbel, Fei-Fei Li, Ehsan Adeli

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations on natural image and medical image datasets demonstrate Rainbow s improvement in both diversity and fidelity across image synthesis, image generation, and counterfactual generation tasks. We conduct experiments to investigate the following hypotheses. H1: Utilizing diverse graphs facilitates generating diverse images; H2: Latent graphs can be extracted into meaningful and interpretable patterns; H3: Improved ability to capture diversity enhances the performance of downstream tasks. Reproducibility details are in Appendix D. 4 Experiment
Researcher Affiliation	Academia	1Dept. of Computer Science, Stanford University, Stanford, CA, USA 2Dept. of Psychiatry and Behavioral Sciences, Stanford University, Stanford, CA, USA 3Dept. of Biomedical Data Science, Stanford University, Stanford, CA, USA 4Center for Intelligent Machines, Mc Gill University, Montreal, QC, Canada 5MILA Quebec AI institute, Montreal, QC, Canada EMAIL EMAIL EMAIL
Pseudocode	Yes	The Algorithm 1 describes the training process of Rainbow to iteratively construct diverse trajectories over the latent graph using GFlow Nets and the Detailed Balance (DB) objective. The process is divided into three phases: initialization, iterative edge sampling, and loss computation.
Open Source Code	No	Yes, the paper provides open access to the data and specifies that the code will be provided upon acceptance.
Open Datasets	Yes	Natural Images We use the Flickr30k dataset [83], which includes about 30k images paired with captions describing daily-life scenes, which contain uncertainty on object choices or styles. 3D Brain MRIs We curate a dataset of about 27k datapoints for training with no diagnosed disease from the following datasets: ADNI [53], ABCD Study [33, 78], HCP [77], PPMI [55], and AIBL [14]. Chest X-rays We use the CheXpert dataset [30], which contains 170k training images.
Dataset Splits	No	The paper mentions using "Flickr30k dataset [83]", "ADNI [53], ABCD Study [33, 78], HCP [77], PPMI [55], and AIBL [14]" for 3D Brain MRIs, and "CheXpert dataset [30]" for Chest X-rays. For natural images, it also says "We quantify the diversity and quality of generations in 60 prompts from the COCO Validation set [41], with 40 images per prompt." However, it does not explicitly state the training, validation, and test splits for these datasets within the context of their own model training. The justification for Q6 in the NeurIPS checklist states "Yes, the paper specifies all the training and test details, including data splits..." but these details are not explicitly present in the accessible text.
Hardware Specification	Yes	For general-domain experiments, training was conducted on a single NVIDIA H100-80GB GPU, completing in 12 hours. The brain MRI experiment utilized 4 NVIDIA H100-80GB GPUs paired with 32 CPU cores over 3 days pretraining, followed by continued Rainbow training on the same hardware configuration for an additional 24 hours. The chest X-ray experiments utilized 4 NVIDIA A100-80GB GPUS paired with 24 CPU cores.
Software Dependencies	No	For both metrics, we use the feature extraction model from pre-trained Pytorch Inception-v3. For feature extraction, we use a 3D ResNet50, which is particularly well-suited for capturing the complex 3D structures and patterns inherent in volumetric data. The feature vectors used for calculating the metrics are extracted from the last layer (before the classifier head) with a dimensionality of 1024. We build our Rainbow on top of the pretrained Stable Diffusion v2-1-base (SD2-1) with frozen pretrained VAE [35] image encoder/decoder, CLIP [56] text encoder, and Unet model. We use a pre-trained DenseNet-121 [28] model from the Torch Xray Vision library [11]. While these identify software and models used, specific version numbers for software libraries or frameworks (e.g., PyTorch version, Torch Xray Vision library version) are not provided.
Experiment Setup	Yes	Table 2 indicates hyperparameters used in this work for all experiments. Parameter Name Natural Images 3D Brain MRIs Chest X-rays Learning Rate 1e-5 25e-7 1e-5 Pretrained-LDM epochs 80 Rainbow epochs 3 20 5 Batch Size 1 1 8 α 1 (Freeze Unet) 0.2 1 (Freeze Unet) β 1 0.8 1 Training image shape 3 256 256 1 160 192 176 512 512 Training condition type Text prompt Age and binary sexes Text prompt Latent image shape 4 64 64 1 32 40 48 44 4 64 64 Latent condition shape Sc 77 1024 256 77 1024 Inference image shape 3 512 512 1 160 192 176 512 512 Encoder Decoder VAE VAE VAE Graph Size N 20 nodes 8 nodes 20 nodes Number of Graphs M 40 8 10 Sparsity ρ 0.83 0.70 0.82 Num. Edges S 32 8 33 Use RNN Yes Yes Yes Edge Embedding dim 512 128 512 Latent dimhg = hc 1024 1024 1024 Blending factor γ 0.5 0.5 0.5