Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales

Authors: Drew Prinster, Xing Han, Anqi Liu, Suchi Saria

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On real-world datasets, we demonstrate improved performance relative to state-of-the-art baselines. We conduct a comprehensive empirical analysis of the WATCH framework on real-world datasets with various distribution shifts. Our results show that WATCH adapts effectively to benign shifts (Section 4.1) and triggers alarms when it fails to adapt (Section 4.2), while also quickly detecting harmful shifts with little delay (Section 4.3). Details on the datasets, models, and additional results, can be found in Appendix E. Code to reproduce all experiments is available at the following repository: https://github.com/aaronhan223/watch.
Researcher Affiliation	Academia	Drew Prinster 1 Xing Han 1 Anqi Liu 1 Suchi Saria 1 1Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA. Correspondence to: Drew Prinster <EMAIL>, Xing Han <EMAIL>.
Pseudocode	Yes	Only limited algorithm pseudocode is provided at this time, focused on algorithms that explain how we penalized noninformative conformal prediction intervals (i.e., with anticonservative p-values). We aim to update the arXiv version of this paper with more comprehensive pseudocode. Algorithm 1 Calculate weighted conformal prediction set for covariate shift (Tibshirani et al., 2019). Algorithm 2 Calculate weighted conformal p-value that penalizes noninformativeness.
Open Source Code	Yes	Code to reproduce all experiments is available at the following repository: https://github.com/aaronhan223/watch.
Open Datasets	Yes	The tabular datasets are for regression tasks (where conformal methods used the absolute value residual nonconformity score b S(x, y) = \|y bµ(x)\|), and the datasest span various sizes and dimensionalities: the Medical Expenditure Panel Survey (MEPS) dataset (33005 samples, 107 features) (Cohen et al., 2009), the UCI Superconductivity dataset (21263 samples, 81 features) (Hamidieh, 2018), and the UCI bike sharing dataset (17379 samples, 12 features) (Fanaee-T, 2013). The image datasets were for classification tasks (where conformal methods used the one-minus-softmax score b S(x, y) = 1 bp(yi \| xi)) were the MNIST-corruption (Mu and Gilmer, 2019) (60000 clean samples, 10000 corrupted samples) and CIFAR-10-corruption (Hendrycks and Dietterich, 2019) (50000 clean samples, 10000 corrupted samples), which are standard benchmarks for assessing distribution shifts.
Dataset Splits	Yes	Training and calibration sets were sampled uniformly at random (with 1/3 of the total data used for training and calibration each), while post-changepoint test-set datapoints were bias-sampled from the remaining holdout data with probability proportional to exp(λ h(x)).
Hardware Specification	No	The paper mentions using "a neural network" or "MLPRegressor" and "ResNet-32" as underlying ML predictors, but does not specify any particular hardware (e.g., GPU models, CPU types, or memory) used to run the experiments or train these models.
Software Dependencies	No	On the tabular data, we used the scikit-learn (Pedregosa et al., 2011) MLPRegressor (with L-BFGS solver and logistic activation); for the image data, we used a 3-layer MLP with ReLU activations on the MNIST datasets and a ResNet-32 (He et al., 2016) on CIFAR-10 datasets. For weight estimation, we use a 3-layer MLP with ReLU activations to distinguish between source and target distributions. The paper mentions software like "scikit-learn" but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	In all experiments and for all baselines, the underlying ML predictor being monitored was a neural network. On the tabular data, we used the scikit-learn (Pedregosa et al., 2011) MLPRegressor (with L-BFGS solver and logistic activation); for the image data, we used a 3-layer MLP with ReLU activations on the MNIST datasets and a ResNet-32 (He et al., 2016) on CIFAR-10 datasets. For weight estimation, we use a 3-layer MLP with ReLU activations to distinguish between source and target distributions. Details of model architectures and training configurations can be found in Table 4 Table 7. (Referring to Appendix E.3 and Tables 4-7, which specify training epochs (30), batch size (64), learning rate (0.001), optimizer (Adam), dropout rate (0.3), and initial temperature for scaling (1.5) for the models used).