Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Conformal Inference under High-Dimensional Covariate Shifts via Likelihood-Ratio Regularization

Authors: Sunay Joshi, Shayan Kiyani, George J. Pappas, Edgar Dobriban, Hamed Hassani

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that the LR-QR algorithm outperforms existing methods on high-dimensional prediction tasks, including a regression task for the Communities and Crime dataset, an image classification task from the WILDS repository, and an LLM question-answering task on the MMLU benchmark. [...] 5 Experiments We compare our method with the following baselines: [...] Results. Figure 1 displays the results.
Researcher Affiliation	Academia	Sunay Joshi University of Pennsylvania EMAIL Shayan Kiyani University of Pennsylvania EMAIL George Pappas University of Pennsylvania EMAIL Edgar Dobriban University of Pennsylvania EMAIL Hamed Hassani University of Pennsylvania EMAIL
Pseudocode	Yes	Algorithm 1 Likelihood-ratio regularized quantile regression
Open Source Code	No	The software package implementing our method and reproducing the experiments will be released as an open-source Git Hub repository upon publication. The datasets used are publicly available.
Open Datasets	Yes	Our experiments demonstrate that the LR-QR algorithm outperforms existing methods on high-dimensional prediction tasks, including a regression task for the Communities and Crime dataset, an image classification task from the WILDS repository, and an LLM question-answering task on the MMLU benchmark. [...] 5.2 Communities and Crime We evaluate our methods on the Communities and Crime dataset [42] [...] 5.3 Rx Rx1 data WILDS Our next experiment uses the Rx Rx1 dataset [53] from the WILDS repository [25] [...] 5.4 Multiple choice questions MMLU Finally, we evaluate all methods using the MMLU benchmark
Dataset Splits	Yes	5.1 Choosing the Regularization Parameter [...] We then perform three-fold cross-validation over the combined calibration and unlabeled target datasets (without using any labeled test data) as follows: [...] 5.2 Communities and Crime [...] We first randomly select half of the data as a training set, and use it to fit a ridge regression model ˆf as our predictor. [...] We then further split the target set into roughly equal unlabeled and labeled subsets. [...] 5.3 Rx Rx1 data WILDS [...] each experiment is selected as the target dataset, and its data is evenly split into an unlabeled target set and a labeled test set.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It only generally states 'We perform small-scale experiments and provide sufficient detail on the set-up.'
Software Dependencies	No	The paper mentions machine learning models (e.g., ridge regression, ResNet50, Llama 13B) and concepts like logistic regression but does not provide specific version numbers for software libraries, frameworks, or programming languages used (e.g., Python 3.x, PyTorch 1.x, TensorFlow 2.x).
Experiment Setup	Yes	5.1 Choosing the Regularization Parameter [...] we form a uniform grid of size ten from λ /10 to λ . We then perform three-fold cross-validation over the combined calibration and unlabeled target datasets [...] We pick λ with the smallest average validation measure across all folds. [...] 5.2 Communities and Crime We first randomly select half of the data as a training set, and use it to fit a ridge regression model ˆf as our predictor. We tune the ridge regularization with five-fold cross-validation. [...] 5.3 Rx Rx1 data WILDS We use a Res Net50 model [20] trained by the WILDS authors on 37 of the 51 experiments. [...] 5.4 Multiple choice questions MMLU [...] we follow a prompt-based scoring scheme adapted for LLMs: we append the string The answer is the option: to the end of each MMLU question and feed the resulting prompt into the Llama 13B model without generating any output. We then extract the next-token logits [...] and consider the logits associated with the characters A, B, C, and D. These four logits are normalized using the softmax function to produce a probability vector over the answer options.