Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Gradient Variance Reveals Failure Modes in Flow-Based Generative Models

Authors: Teodora Reu, Sixtine Dromigny, Michael Bronstein, Francisco Vargas

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our theoretical and empirical study uncovers why standard neural architectures struggle to represent even simple transports and how repeated rectifications can lead to memorization rather than improvement. We validate our findings empirically on the Celeb A dataset, confirming that deterministic interpolants induce memorization, while the injection of small noise restores generalization.
Researcher Affiliation	Collaboration	Teodora Reu University of Oxford EMAIL Sixtine Dromigny University of Oxford Michael Bronstein University of Oxford Francisco Vargas Xaira Therapeutics
Pseudocode	No	The paper describes methodologies and theoretical proofs in narrative text and mathematical formulations. It does not contain any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	We will provide our code in the supplementary materials.
Open Datasets	Yes	We validate our findings empirically on the Celeb A dataset, confirming that deterministic interpolants induce memorization, while the injection of small noise restores generalization. To empirically validate Proposition 2, we conduct experiments on the Celeb A dataset using Conditional Flow Matching (CFM), with ground-truth optimal transport (OT) pairings from Korotin et al. [2021] as reference. trained four U-Net models to learn the transport map from π0 = N(0, Id) to π1 CIFAR-10, using both noiseless and noisy interpolants (see for experimental details Appendix G).
Dataset Splits	Yes	The model variants are compared on two criteria: generalization (L2 error to true OT targets), and Memorization (L2 error to the shuffled (training) targets). We measure both on held-out subsets (5K and 50K samples). Memorization effects diminish with larger datasets, as fixed model capacity makes perfect memorization infeasible for 50K samples compared to 5K. Models are trained and evaluated on both in-sample (training) and out-of-sample (test) data. We computed FID on both the training and validation sets
Hardware Specification	Yes	All CIFAR-10 experiments were conducted on a compute cluster equipped with NVIDIA A10 GPUs (24 GB VRAM, CUDA 12.2). Each training run was allocated a single A10 GPU and typically ran for 24 hours to reach 240,000 optimization steps.
Software Dependencies	No	All CIFAR-10 experiments were conducted on a compute cluster equipped with NVIDIA A10 GPUs (24 GB VRAM, CUDA 12.2).
Experiment Setup	Yes	Model architecture. All experiments used a U-Net-based neural network (UNet Model Wrapper) with the following configuration: input shape (3, 32, 32), base channels 128, 2 residual blocks per level, channel multipliers [1, 2, 2, 2], attention at 16x16 resolution (4 heads, 64 head channels), and dropout rate 0.1. The model is wrapped in a Neural ODE solver (Euler method). Training. Models were trained on the CIFAR-10 training set, using random horizontal flips and normalization to [-1, 1]. Optimization used Adam with learning rate 2e-4, batch size 128, gradient clipping at 1.0, and a linear warmup over the first 5,000 steps. Each run used 400,001 steps (unless otherwise noted), with exponential moving average (EMA) of model weights (0.9999 decay). Checkpoints were saved every 20,000 steps. All experiments used 4 data loader workers and CUDA if available.