Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Model-Informed Flows for Bayesian Inference

Authors: Joohwan Ko, Justin Domke

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, MIF delivers tighter posterior approximations and matches or exceeds state-of-the-art performance across a suite of hierarchical and non-hierarchical benchmarks. 5 Experiments Table 1: Negative ELBOs ( ELBO) for hierarchical benchmark models using mean-field Gaussian (MF), mean-field Gaussian with VIP (MF-VIP), full-rank Gaussian (FR), and full-rank Gaussian with VIP (FR-VIP). Lower values indicate tighter posterior approximations.
Researcher Affiliation	Academia	Joohwan Ko, Justin Domke Manning College of Information and Computer Sciences University of Massachusetts Amherst EMAIL
Pseudocode	Yes	Algorithm 1 Model-Informed Flow (MIF)
Open Source Code	Yes	Our implementation is available at https://github. com/joohwanko/Model-Informed-Flow
Open Datasets	Yes	To validate our theoretical results, we evaluate on six different hierarchical Bayesian models: 8Schools, German Credit, Funnel, Radon, Movielens, and IRT. These models exhibit varying degrees of funnel-like posterior geometries, making them ideal testbeds for examining the benefits of VIP-inspired flow designs. For our final comprehensive benchmark experiments, we also include more models such as Seeds, Sonar, and Ionosphere. D Experimental Details Eight Schools The Eight Schools model [39]
Dataset Splits	No	The paper lists several benchmark datasets like 8Schools, German Credit, Funnel, etc., but does not explicitly provide information on how these datasets were split into training, validation, or test sets for the experiments conducted in this paper. It implicitly relies on standard benchmarks but does not specify the splitting methodology or percentages.
Hardware Specification	Yes	All experiments were run on a single server with an Intel Xeon Platinum 8352Y CPU (128 hardware threads at 2.20 GHz), 512 Gi B of RAM. For each experiment (i.e., each training or evaluation run), we used one NVIDIA A100 (40 Gi B) under CUDA 12.8.
Software Dependencies	No	The paper mentions 'CUDA 12.8' for the GPU, but does not list other key software components like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or their specific version numbers.
Experiment Setup	Yes	For all experiments, we utilize the Adam optimizer [26] and initialize all learnable parameters from a standard Gaussian distribution with a standard deviation of 0.1. ... We perform 100,000 optimization iterations, using 256 Monte Carlo samples to approximate the ELBO during training. After training, the final ELBO is evaluated using 100,000 fresh samples for reliable estimation. For each model configuration, we report the best ELBO achieved after exploring six learning rates (logarithmically spaced from 10-1 to 10-6).