Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Flow Matching Neural Processes

Authors: Hussen Abu Hamad, Dan Rosenbaum

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that our model outperforms previous state-of-the-art neural process methods on various benchmarks including synthetic 1D Gaussian processes data, 2D images, and real-world weather data. Extensive experimentation on standard NP benchmarks demonstrate that Flow NP consistently outperforms prior NP models, achieving state-of-the-art results across multiple datasets. Our experiments are conducted on three distinct data domains: synthetic 1D Gaussian processes, image data from EMNIST [10] and Celeb A [28], and real-world weather prediction data ERA5 [16].
Researcher Affiliation	Academia	Hussen Abu Hamad Department of Computer Science University of Haifa Dan Rosenbaum Department of Computer Science University of Haifa
Pseudocode	Yes	The training process is presented in Alg. 1 in the appendix. The sampling process is presented in Alg. 2 in the appendix.
Open Source Code	Yes	Our implementation is available at https://github.com/danrsm/flow NP. The implementation of all experiments and models is available at https://github.com/danrsm/flow NP. We provide code for the models that can be used with open sourced code for the experiments. Upon publication we will open source our full repository.
Open Datasets	Yes	Our experiments are conducted on three distinct data domains: synthetic 1D Gaussian processes, image data from EMNIST [10] and Celeb A [28], and real-world weather prediction data ERA5 [16].
Dataset Splits	Yes	Evaluation is done on held-out data using a random number of context and target points sampled uniformly between 3 and 47 where the total number is constrained to be equal or less than 50. In the second protocol [8] we use two GP kernels: RBF and Matern5 2, using a fixed set of parameters, and additional Gaussian observation noise with variance 0.052. Evaluation is done on held-out data with the same parameters, a uniformly random context size between 1 and 10, and a fixed number of 50 target points. We divide the years to 34K training samples and 17.5K evaluation samples.
Hardware Specification	Yes	All training, inference and sampling are performed with an NVIDIA RTX4090 GPU.
Software Dependencies	No	The paper does not provide specific version numbers for key software components like libraries or frameworks (e.g., PyTorch, Python, CUDA versions) in the main text. It mentions using an ODE solver implemented by Lipman et al. [26], but without a specific version number for the software.
Experiment Setup	Yes	We implement a single Flow NP model architecture across all experiments in the main paper, adapting only the input dimensions dim(X) and output dimensions dim(Y). We use a transformer with 6 layers of full self attention, 128 hidden dimensions and 4 attention heads. We use sinusoidal encodings for the input x and flow time t with 10 frequencies per dimension, except for the ERA5 experiments where we use 40 frequencies per dimension. For likelihood evaluation we use an ODE solver based on the midpoint method with 100 steps implemented by Lipman et al. [26], and for sampling we use the Euler method with 100 steps.