reproducibilityindex.ai

Sequential Harmful Shift Detection Without Labels

Authors: Salim I. Amoukou, Tom Bewley, Saumitra Mishra, Freddy Lecue, Daniele Magazzeni, Manuela Veloso

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that our method has high power and false alarm control under various distribution shifts, including covariate and label shifts and natural shifts over geography and time.Section 5 demonstrates the empirical efficacy of our method, showcasing its strong detection capabilities and controlled false alarm rates across various types of harmful shift.
Researcher Affiliation	Industry	Salim I. Amoukou Tom Bewley Saumitra Mishra Freddy Lecue Daniele Magazzeni Manuela Veloso J.P. Morgan AI ResearchCorrespondence to: Salim I. Amoukou <salim.ibrahimamoukou@jpmorgan.com>
Pseudocode	No	The paper describes its methods using prose and mathematical equations but does not include any structured pseudocode blocks or algorithms.
Open Source Code	No	We will also release the code with a proper readme to use the methods.
Open Datasets	Yes	using the California house prices [Dua and Graff, 2017], Bike sharing demand [Fanaee-T, 2013], HELOC [FICO, 2018] and Nhanesi [CDC, 1999-2022] datasets.We partition each dataset into training (60%), test (20%) and calibration (20%) sets and use the training data to train random forests (RFs) as the primary models.
Dataset Splits	Yes	We partition each dataset into training (60%), test (20%) and calibration (20%) sets and use the training data to train random forests (RFs) as the primary models.We split this dataset into a training set (60%), test set (20%) and calibration set (20%), and train a Res Net50 on the training set. Using half of the calibration set, we train another Res Net50 (with a regression head) as an error estimator. The remaining half is employed to determine the empirical quantiles p [0.5, 1), ˆp (0, 1) at which we achieve maximum power while keeping the FDP below 0.2.
Hardware Specification	Yes	We run all our experiments on an Amazon EC2 instance (c5.4xlarge) that consists of 16 v CPUs and 32 GB of RAM.
Software Dependencies	No	The paper mentions software components like 'random forests (RFs)' and 'Res Net50 model' but does not provide specific version numbers for any libraries or frameworks (e.g., scikit-learn version, PyTorch/TensorFlow version).
Experiment Setup	Yes	For continuous features, we exclude 80% of observations with values either above or below the median. For categorical features, we exclude data from one category.We use half of the calibration sets to train RF regressors as the error estimators, then use the remainder to calibrate true and estimated error thresholds using the grid search process described above.We consider a shift to be harmful if the model s error in production exceeds the error on the calibration dataset plus ϵtol = 0.