reproducibilityindex.ai

Identifying Mislabeled Data using the Area Under the Margin Ranking

Authors: Geoff Pleiss, Tianyi Zhang, Ethan Elenberg, Kilian Q. Weinberger

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test the efﬁcacy of AUM and threshold samples in two ways. First, we directly measure the precision and recall of our identiﬁcation procedure on synthetic noisy datasets. Second, we train models on noisy datasets after removing the identiﬁed data. We use test-error as a proxy for identiﬁcation performance removing mislabeled samples should improve accuracy, whereas removing correctly-labeled samples should hurt accuracy. In all experiments we do not assume the presence of any trusted data for training or validation.
Researcher Affiliation	Collaboration	Geoff Pleiss Columbia University gmp2162@columbia.edu Tianyi Zhang Stanford University tz58@stanford.edu Ethan Elenberg ASAPP eelenberg@asapp.com Kilian Q. Weinberger ASAPP, Cornell University
Pseudocode	Yes	Putting this all together, we propose the following procedure for identifying mislabeled data: 1. Create a subset DTHR of threshold samples: 2. Construct a modiﬁed training set D train that includes the threshold samples. 3. Train a network on D train until the ﬁrst learning rate drop, measuring the AUM of all data. 4. Compute α: the 99th percentile threshold sample AUM. 5. Identify mislabeled data using α as a threshold {(x, y) (Dtrain\DTHR) : AUMx,y α}.
Open Source Code	Yes	Our package (pip install aum) can be used with any Py Torch classiﬁcation model. [...] We provide a simple package (pip install aum) that computes AUM for any Py Torch classiﬁer.
Open Datasets	Yes	We use synthetically-mislabeled versions of CIFAR10/100 [30], where subsets of 45,000 images are used for training. We also consider Tiny Image Net, a 200-class subset of Image Net [13] with 95,000 images resized to 64 64. Web Vision [35] contains 2 million images... Clothing1M [60] contains clothing images...
Dataset Splits	No	In all experiments we do not assume the presence of any trusted data for training or validation. [...] We do not perform early stopping since we do not assume the presence of a clean validation set.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies	No	The paper mentions PyTorch and a 'pip install aum' package but does not provide specific version numbers for PyTorch or any other software dependencies, which are required for reproducibility.
Experiment Setup	Yes	For our method (AUM), as well as the BMM and GMM methods, we train networks for 150 epochs with no learning rate drops. [...] In all the above experiments, we train networks and compute AUM values using standard data augmentation (random image ﬂips and crops).