reproducibilityindex.ai

Interactive Label Cleaning with Example-based Explanations

Authors: Stefano Teso, Andrea Bontempelli, Fausto Giunchiglia, Andrea Passerini

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive empirical evaluation shows that clarifying the reasons behind the model s suspicions by cleaning the counter-examples helps in acquiring substantially better data and models, especially when paired with our FIM approximation. We empirically address the following research questions: Q1: Do counter-examples contribute to cleaning the data? Q2: Which inﬂuence-based selection strategy identiﬁes the most mislabeled counter-examples? Q3: What contributes to the effectiveness of the best counter-example selection strategy?
Researcher Affiliation	Academia	Stefano Teso University of Trento Trento, Italy stefano.teso@unitn.it Andrea Bontempelli University of Trento Trento, Italy andrea.bontempelli@unitn.it Fausto Giunchiglia University of Trento Trento, Italy fausto.giunchiglia@unitn.it Andrea Passerini University of Trento Trento, Italy andrea.passerini@unitn.it
Pseudocode	Yes	The pseudo-code of CINCER is listed in Algorithm 1.
Open Source Code	Yes	The code for all experiments is available at: https://github.com/abonte/cincer.
Open Datasets	Yes	Data sets. We used a diverse set of classiﬁcation data sets: Adult [27]: data set of 48,800 persons... Breast [27]: data set of 569 patients... 20NG [27]: data set of newsgroup posts... MNIST [29]: handwritten digit recognition data set... Fashion [30]: fashion article classiﬁcation dataset...
Dataset Splits	Yes	For adult and breast, a random 80 : 20 training-test split is used while for MNIST, fashion and 20 NG the split provided with the data set is used.
Hardware Specification	Yes	All experiments were run on a 12-core machine with 16 Gi B of RAM and no GPU.
Software Dependencies	No	We implemented CINCER using Python and Tensorﬂow [25] on top of three classiﬁers and compared different counter-example selection strategies on ﬁve data sets.
Experiment Setup	Yes	Upon receiving a new example, the classiﬁer is retrained from scratch for 100 epochs using Adam [31] with default parameters, with early stopping when the accuracy on the training set reaches 90% for FC and CNN, and 70% for LR. The margin threshold is set to τ = 0.2.