Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Inductive Domain Transfer In Misspecified Simulation-Based Inference

Authors: Ortal Senouf, Antoine Wehenkel, Cédric Vincent-Cuaz, Emmanuel Abbe, Pascal Frossard

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated our proposed approach on four benchmarks: a synthetic one, two real but controlled ones, and one complex real-world benchmark.
Researcher Affiliation Collaboration Ortal Senouf EPFL Lausanne, Switzerland Antoine Wehenkel Apple Zürich, Switzerland Cédric Vincent-Cuaz EPFL Lausanne, Switzerland Emmanuel Abbé EPFL, Apple Lausanne, Switzerland Pascal Frossard EPFL Lausanne, Switzerland
Pseudocode Yes A complete description of the full pipeline and training procedure is provided in Algorithm 3.2.
Open Source Code Yes Code is available for all experiments.
Open Datasets Yes Causal Chambers [34]. Two real, controlled datasets collected from experimental rigs a wind tunnel and a light tunnel with adjustable parameters. This benchmark uses a subset [6, 35] of the MIMIC-II dataset [36], comprising 350 patients who underwent thermodilution a procedure estimating cardiac output (CO) via cold fluid injection and downstream temperature measurement. For simulation, we use Open BF [37], a validated 1D cardiovascular flow simulator supporting fast, multiscale finite-volume simulations.
Dataset Splits Yes We evaluate methods that rely on calibration data using 5-fold cross-validation. In each fold, a different, randomly sampled subset of the calibration data is used for training and validation, with independent random initialization of model weights. This approach captures both data variability and the effects of random initialization, providing a robust assessment of performance. We evaluate calibration set sizes of 10, 50, 200, and 1000 samples, while keeping the test set size fixed at 1000 samples across all benchmarks.
Hardware Specification No Not in the body of the paper, but in the supplementary material
Software Dependencies Yes For all experiments involving OT, we use solvers from the POT Python library [39, 40].
Experiment Setup Yes Following the guidelines in [22], the entropy regularization weight γ is set to 0.5 for all baselines involving OT, including our joint training approach. Finally, since this section focuses on the setting where both simulations and real observations are drawn from the same prior distribution p(θ), we use balanced Sinkhorn OT for Ro PE, effectively setting ρ in eq. 1 to a very high value.