Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Estimating Model Performance Under Covariate Shift Without Labels

Authors: Jakub Białek, Juhani Kivimäki, Wojciech Kuberski, Nikolaos Perrakis

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We tested PAPE using over 900 dataset-model combinations from the US census data, assessing its performance against several benchmarks through various metrics. Our findings show that PAPE outperforms other methodologies, making a superior choice for estimating the performance of binary classification models.
Researcher Affiliation Collaboration Jakub Białek Nanny ML NV Belgium EMAIL Juhani Kivimäki University of Helsinki Finland EMAIL Wojtek Kuberski Nanny ML NV Belgium EMAIL Nikolaos Perrakis Nanny ML NV Belgium EMAIL
Pseudocode No The paper describes the methodology using prose and mathematical equations but does not present any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes We make our code available as a public github repository (https://github. com/pape-research/pape_r).
Open Datasets Yes The datasets we used to evaluate the method come from Folktables [36]. Folktables uses US census data and preprocesses it to create a set of binary classification problems. We ran additional experiments with data from the recently published Table Shift benchmark [35]
Dataset Splits Yes We started by separating the first-year data (2015) in each fetched dataset and used it as a training period. The rest of the data for each case was further divided into two periods reference (the year 2016) and production (2017, 2018). Production data was further split into data chunks of 2,000 observations each, maintaining the order of the observations.
Hardware Specification Yes We used a single 11th Gen Intel i7-11800H 2.30GHz machine; computation took over 120 hours.
Software Dependencies No The paper mentions using "LGBM Classifier" and "LGBM Regressor" as well as other algorithms (Logistic Regression, Neural Network Model, Random Forest, XGBoost), but it does not specify concrete version numbers for these software components.
Experiment Setup Yes For each resulting training data set, we fitted five commonly used binary classification algorithms: Logistic Regression, Neural Network Model [37], Random Forest [38], XGBoost [39], and LGBM [40] with default parameters. We use an LGBM [40] Classifier as the DRE model, and an LGBM Regressor for the calibration mapping, both with default hyperparameters.