Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multivariate Latent Recalibration for Conditional Normalizing Flows

Authors: Victor Dheur, Souhaib Ben Taieb

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on both tabular and image datasets show that LR consistently improves latent calibration error and the negative log-likelihood of the recalibrated models.
Researcher Affiliation	Academia	Victor Dheur Department of Computer Science University of Mons Mons, Belgium EMAIL Souhaib Ben Taieb Department of Statistics and Data Science Mohamed bin Zayed University of Artificial Intelligence Abu Dhabi, United Arab Emirates EMAIL
Pseudocode	Yes	Algorithm 1 Pre-rank recalibration.
Open Source Code	Yes	A public codebase is provided to ensure reproducibility.1 1https://github.com/Vekteur/latent-recalibration
Open Datasets	Yes	We present an extensive experimental study using 29 tabular datasets widely used in prior research (Tsoumakas et al., 2011; Cevid et al., 2022; Chung et al., 2024; Feldman et al., 2023; Wang et al., 2023; Barrio et al., 2024; Camehl et al., 2024). Furthermore, while recent work on model recalibration (Chung et al., 2024; Fang et al., 2025) has primarily focused on data modalities with relatively low output dimensionality, we also include a high-dimensional output setting with an image dataset with a larger output dimension (Choi et al., 2020).
Dataset Splits	Yes	Following the protocol of Chung et al. (2024), we use a 65/20/15 split for training, validation, and testing.
Hardware Specification	Yes	Computing the main tabular data results requires approximately 24 hours on an RTX A6000 GPU, and reproducing the image results requires approximately 6 hours on an RTX 6000 GPU.
Software Dependencies	No	The paper mentions "Adam (Kingma and Ba, 2014)" as the optimizer but does not specify version numbers for any software libraries or frameworks used in the implementation.
Experiment Setup	Yes	For the convex potential flow, the number of units in the input convex neural network is chosen from [10, 20, 40], the number of layers from [2, 3, 5], and the learning rate from [5 10 3, 10 3, 2 10 4]. All models are trained by minimizing the NLL with the Adam optimizer (Kingma and Ba, 2014) using a batch size of 1024.