Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift

Authors: Stephan Rabanser, Stephan Günnemann, Zachary Lipton

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper explores the problem of building ML systems that fail loudly, investigating methods for detecting dataset shift, identifying exemplars that most typify the shift, and quantifying shift malignancy. We focus on several datasets and various perturbations to both covariates and label distributions with varying magnitudes and fractions of data affected.
Researcher Affiliation Collaboration Stephan Rabanser AWS AI Labs rabans@amazon.com Stephan G unnemann Technical University of Munich guennemann@in.tum.de Zachary C. Lipton Carnegie Mellon University zlipton@cmu.edu
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes We provide a sample implementation of our experiments-pipeline written in Python, making use of sklearn [36] and Keras [11], located at: https://github.com/steverab/failing-loudly.
Open Datasets Yes Our main experiments were carried out on the MNIST (Ntr = 50000; Nval = 10000; Nte = 10000; D = 28 28 1; C = 10 classes) [25] and CIFAR-10 (Ntr = 40000; Nval = 10000; Nte = 10000; D = 32 32 3; C = 10 classes) [23] image datasets.
Dataset Splits Yes Our main experiments were carried out on the MNIST (Ntr = 50000; Nval = 10000; Nte = 10000; D = 28 28 1; C = 10 classes) [25] and CIFAR-10 (Ntr = 40000; Nval = 10000; Nte = 10000; D = 32 32 3; C = 10 classes) [23] image datasets.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types, or memory amounts) used for running the experiments. It only mentions the training environment implicitly by using libraries like Keras.
Software Dependencies No The paper mentions "Python, making use of sklearn [36] and Keras [11]" but does not specify version numbers for these software components, which is required for reproducibility.
Experiment Setup Yes We train all networks (TAE, BBSDs, BBSDh, Classif) using stochastic gradient descent with momentum in batches of 128 examples over 200 epochs with early stopping.