Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Multivariate Latent Recalibration for Conditional Normalizing Flows
Authors: Victor Dheur, Souhaib Ben Taieb
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on both tabular and image datasets show that LR consistently improves latent calibration error and the negative log-likelihood of the recalibrated models. |
| Researcher Affiliation | Academia | Victor Dheur Department of Computer Science University of Mons Mons, Belgium EMAIL Souhaib Ben Taieb Department of Statistics and Data Science Mohamed bin Zayed University of Artificial Intelligence Abu Dhabi, United Arab Emirates EMAIL |
| Pseudocode | Yes | Algorithm 1 Pre-rank recalibration. |
| Open Source Code | Yes | A public codebase is provided to ensure reproducibility.1 1https://github.com/Vekteur/latent-recalibration |
| Open Datasets | Yes | We present an extensive experimental study using 29 tabular datasets widely used in prior research (Tsoumakas et al., 2011; Cevid et al., 2022; Chung et al., 2024; Feldman et al., 2023; Wang et al., 2023; Barrio et al., 2024; Camehl et al., 2024). Furthermore, while recent work on model recalibration (Chung et al., 2024; Fang et al., 2025) has primarily focused on data modalities with relatively low output dimensionality, we also include a high-dimensional output setting with an image dataset with a larger output dimension (Choi et al., 2020). |
| Dataset Splits | Yes | Following the protocol of Chung et al. (2024), we use a 65/20/15 split for training, validation, and testing. |
| Hardware Specification | Yes | Computing the main tabular data results requires approximately 24 hours on an RTX A6000 GPU, and reproducing the image results requires approximately 6 hours on an RTX 6000 GPU. |
| Software Dependencies | No | The paper mentions "Adam (Kingma and Ba, 2014)" as the optimizer but does not specify version numbers for any software libraries or frameworks used in the implementation. |
| Experiment Setup | Yes | For the convex potential flow, the number of units in the input convex neural network is chosen from [10, 20, 40], the number of layers from [2, 3, 5], and the learning rate from [5 10 3, 10 3, 2 10 4]. All models are trained by minimizing the NLL with the Adam optimizer (Kingma and Ba, 2014) using a batch size of 1024. |