Identifying Mislabeled Data using the Area Under the Margin Ranking
Authors: Geoff Pleiss, Tianyi Zhang, Ethan Elenberg, Kilian Q. Weinberger
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test the efficacy of AUM and threshold samples in two ways. First, we directly measure the precision and recall of our identification procedure on synthetic noisy datasets. Second, we train models on noisy datasets after removing the identified data. We use test-error as a proxy for identification performance removing mislabeled samples should improve accuracy, whereas removing correctly-labeled samples should hurt accuracy. In all experiments we do not assume the presence of any trusted data for training or validation. |
| Researcher Affiliation | Collaboration | Geoff Pleiss Columbia University gmp2162@columbia.edu Tianyi Zhang Stanford University tz58@stanford.edu Ethan Elenberg ASAPP eelenberg@asapp.com Kilian Q. Weinberger ASAPP, Cornell University |
| Pseudocode | Yes | Putting this all together, we propose the following procedure for identifying mislabeled data: 1. Create a subset DTHR of threshold samples: 2. Construct a modified training set D train that includes the threshold samples. 3. Train a network on D train until the first learning rate drop, measuring the AUM of all data. 4. Compute α: the 99th percentile threshold sample AUM. 5. Identify mislabeled data using α as a threshold {(x, y) (Dtrain\DTHR) : AUMx,y α}. |
| Open Source Code | Yes | Our package (pip install aum) can be used with any Py Torch classification model. [...] We provide a simple package (pip install aum) that computes AUM for any Py Torch classifier. |
| Open Datasets | Yes | We use synthetically-mislabeled versions of CIFAR10/100 [30], where subsets of 45,000 images are used for training. We also consider Tiny Image Net, a 200-class subset of Image Net [13] with 95,000 images resized to 64 64. Web Vision [35] contains 2 million images... Clothing1M [60] contains clothing images... |
| Dataset Splits | No | In all experiments we do not assume the presence of any trusted data for training or validation. [...] We do not perform early stopping since we do not assume the presence of a clean validation set. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments. |
| Software Dependencies | No | The paper mentions PyTorch and a 'pip install aum' package but does not provide specific version numbers for PyTorch or any other software dependencies, which are required for reproducibility. |
| Experiment Setup | Yes | For our method (AUM), as well as the BMM and GMM methods, we train networks for 150 epochs with no learning rate drops. [...] In all the above experiments, we train networks and compute AUM values using standard data augmentation (random image flips and crops). |