Interactive Label Cleaning with Example-based Explanations
Authors: Stefano Teso, Andrea Bontempelli, Fausto Giunchiglia, Andrea Passerini
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive empirical evaluation shows that clarifying the reasons behind the model s suspicions by cleaning the counter-examples helps in acquiring substantially better data and models, especially when paired with our FIM approximation. We empirically address the following research questions: Q1: Do counter-examples contribute to cleaning the data? Q2: Which influence-based selection strategy identifies the most mislabeled counter-examples? Q3: What contributes to the effectiveness of the best counter-example selection strategy? |
| Researcher Affiliation | Academia | Stefano Teso University of Trento Trento, Italy stefano.teso@unitn.it Andrea Bontempelli University of Trento Trento, Italy andrea.bontempelli@unitn.it Fausto Giunchiglia University of Trento Trento, Italy fausto.giunchiglia@unitn.it Andrea Passerini University of Trento Trento, Italy andrea.passerini@unitn.it |
| Pseudocode | Yes | The pseudo-code of CINCER is listed in Algorithm 1. |
| Open Source Code | Yes | The code for all experiments is available at: https://github.com/abonte/cincer. |
| Open Datasets | Yes | Data sets. We used a diverse set of classification data sets: Adult [27]: data set of 48,800 persons... Breast [27]: data set of 569 patients... 20NG [27]: data set of newsgroup posts... MNIST [29]: handwritten digit recognition data set... Fashion [30]: fashion article classification dataset... |
| Dataset Splits | Yes | For adult and breast, a random 80 : 20 training-test split is used while for MNIST, fashion and 20 NG the split provided with the data set is used. |
| Hardware Specification | Yes | All experiments were run on a 12-core machine with 16 Gi B of RAM and no GPU. |
| Software Dependencies | No | We implemented CINCER using Python and Tensorflow [25] on top of three classifiers and compared different counter-example selection strategies on five data sets. |
| Experiment Setup | Yes | Upon receiving a new example, the classifier is retrained from scratch for 100 epochs using Adam [31] with default parameters, with early stopping when the accuracy on the training set reaches 90% for FC and CNN, and 70% for LR. The margin threshold is set to τ = 0.2. |