Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales
Authors: Drew Prinster, Xing Han, Anqi Liu, Suchi Saria
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On real-world datasets, we demonstrate improved performance relative to state-of-the-art baselines. We conduct a comprehensive empirical analysis of the WATCH framework on real-world datasets with various distribution shifts. Our results show that WATCH adapts effectively to benign shifts (Section 4.1) and triggers alarms when it fails to adapt (Section 4.2), while also quickly detecting harmful shifts with little delay (Section 4.3). Details on the datasets, models, and additional results, can be found in Appendix E. Code to reproduce all experiments is available at the following repository: https://github.com/aaronhan223/watch. |
| Researcher Affiliation | Academia | Drew Prinster 1 Xing Han 1 Anqi Liu 1 Suchi Saria 1 1Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA. Correspondence to: Drew Prinster <EMAIL>, Xing Han <EMAIL>. |
| Pseudocode | Yes | Only limited algorithm pseudocode is provided at this time, focused on algorithms that explain how we penalized noninformative conformal prediction intervals (i.e., with anticonservative p-values). We aim to update the arXiv version of this paper with more comprehensive pseudocode. Algorithm 1 Calculate weighted conformal prediction set for covariate shift (Tibshirani et al., 2019). Algorithm 2 Calculate weighted conformal p-value that penalizes noninformativeness. |
| Open Source Code | Yes | Code to reproduce all experiments is available at the following repository: https://github.com/aaronhan223/watch. |
| Open Datasets | Yes | The tabular datasets are for regression tasks (where conformal methods used the absolute value residual nonconformity score b S(x, y) = |y bยต(x)|), and the datasest span various sizes and dimensionalities: the Medical Expenditure Panel Survey (MEPS) dataset (33005 samples, 107 features) (Cohen et al., 2009), the UCI Superconductivity dataset (21263 samples, 81 features) (Hamidieh, 2018), and the UCI bike sharing dataset (17379 samples, 12 features) (Fanaee-T, 2013). The image datasets were for classification tasks (where conformal methods used the one-minus-softmax score b S(x, y) = 1 bp(yi | xi)) were the MNIST-corruption (Mu and Gilmer, 2019) (60000 clean samples, 10000 corrupted samples) and CIFAR-10-corruption (Hendrycks and Dietterich, 2019) (50000 clean samples, 10000 corrupted samples), which are standard benchmarks for assessing distribution shifts. |
| Dataset Splits | Yes | Training and calibration sets were sampled uniformly at random (with 1/3 of the total data used for training and calibration each), while post-changepoint test-set datapoints were bias-sampled from the remaining holdout data with probability proportional to exp(ฮป h(x)). |
| Hardware Specification | No | The paper mentions using "a neural network" or "MLPRegressor" and "ResNet-32" as underlying ML predictors, but does not specify any particular hardware (e.g., GPU models, CPU types, or memory) used to run the experiments or train these models. |
| Software Dependencies | No | On the tabular data, we used the scikit-learn (Pedregosa et al., 2011) MLPRegressor (with L-BFGS solver and logistic activation); for the image data, we used a 3-layer MLP with ReLU activations on the MNIST datasets and a ResNet-32 (He et al., 2016) on CIFAR-10 datasets. For weight estimation, we use a 3-layer MLP with ReLU activations to distinguish between source and target distributions. The paper mentions software like "scikit-learn" but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | In all experiments and for all baselines, the underlying ML predictor being monitored was a neural network. On the tabular data, we used the scikit-learn (Pedregosa et al., 2011) MLPRegressor (with L-BFGS solver and logistic activation); for the image data, we used a 3-layer MLP with ReLU activations on the MNIST datasets and a ResNet-32 (He et al., 2016) on CIFAR-10 datasets. For weight estimation, we use a 3-layer MLP with ReLU activations to distinguish between source and target distributions. Details of model architectures and training configurations can be found in Table 4 Table 7. (Referring to Appendix E.3 and Tables 4-7, which specify training epochs (30), batch size (64), learning rate (0.001), optimizer (Adam), dropout rate (0.3), and initial temperature for scaling (1.5) for the models used). |