Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
A hierarchical decomposition for explaining ML performance discrepancies
Authors: Harvineet Singh, Fan Xia, Adarsh Subbaswamy, Alexej Gossmann, Jean Feng
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the utility of our framework in real-world examples of prediction models for hospital readmission and insurance coverage. Code for reproducing experiments is available at https://github.com/jjfeng/HDPD. |
| Researcher Affiliation | Collaboration | Harvineet Singh1 Fan Xia1 Adarsh Subbaswamy2 Alexej Gossmann2 Jean Feng1 1University of California, San Francisco 2U.S. Food and Drug Administration, Center for Devices and Radiological Health |
| Pseudocode | Yes | Algorithm 1 Aggregate decompositions into baseline, conditional covariate, and conditional outcome shifts; Algorithm 2 VALUECONDITIONALOUTCOME(S): Value for s-partial conditional outcome shift for a subset s; Algorithm 3 VALUECONDITIONALCOVARIATE(S): Value for s-partial conditional covariate shift for a subset s; Algorithm 4 Detailed decomposition for conditional outcome and covariate shift |
| Open Source Code | Yes | Code for reproducing experiments is available at https://github.com/jjfeng/HDPD. |
| Open Datasets | Yes | We analyze a neural network trained to predict whether a person has public health insurance using data from Nebraska in the American Community Survey (source, n = 3000), applied to data from Louisiana (target, n = 6000). |
| Dataset Splits | Yes | Let the data be randomly split into training and evaluation partitions. ... We fit all models on 80% of the data points from both source and target datasets which is the Tr partition, and keep the remaining 20% for computing the estimators which is the Ev partition. |
| Hardware Specification | Yes | All experiments are run on a 2.60 GHz processor with 8 CPU cores. |
| Software Dependencies | No | The paper mentions using 'scikit-learn implementations' but does not specify version numbers for any software dependencies like scikit-learn, Python, or other libraries. |
| Experiment Setup | Yes | We use 3-fold cross validation to select models. ... We clip the predicted probabilities from the density model for π at 10 6 to avoid very large density weights. ... Specific hyperparameter ranges for the grid search are provided in the code. |