DOCTOR: A Simple Method for Detecting Misclassification Errors

Authors: Federica Granese, Marco Romanelli, Daniele Gorla, Catuscia Palamidessi, Pablo Piantanida

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show that DOCTOR outperforms all state-of-the-art methods on various well-known images and sentiment analysis datasets. In this section we present a collection of experimental results to investigate the effectiveness of DOCTOR, by applying it to several benchmark datasets.
Researcher Affiliation Academia Federica Granese Lix, Inria, Institute Polytechnique de Paris, Sapienza University of Rome federica.granese@inria.fr Marco Romanelli L2S, Centrale Supélec, CNRS, Université Paris Saclay marco.romanelli@centralesupelec.fr Daniele Gorla Sapienza University of Rome gorla@di.uniroma1.it Catuscia Palamidessi Lix, Inria, Institute Polytechnique de Paris, catuscia@lix.polytechnique.fr Pablo Piantanida L2S, Centrale Supélec, CNRS, Université Paris Saclay pablo.piantanida@centralesupelec.fr
Pseudocode No The paper defines mathematical expressions and discriminators but does not present them as structured pseudocode or algorithm blocks.
Open Source Code Yes We provide publicly available code1 to reproduce our results, and we give further details on the environment, the parameter setting and the experimental setup in the Supplementary material (Appendix C). 1https://github.com/doctor-public-submission/DOCTOR/
Open Datasets Yes Datasets and pre-trained networks. We run experiments on both image and textual datasets. We use CIFAR10 and CIFAR100 [18], Tiny Image Net [16] and SVHN [27] as image datasets; IMDb [25], Amazon Fashion and Amazon Software [28] as textual datasets.
Dataset Splits No According to our framework, no validation samples are available; consequently, in order to be consistent across the datasets, we only report the experimental settings and values for which, on average, we obtain favorable results for all the considered domains (cf. Figure 2).
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies No The paper mentions providing details on the 'environment' in the Supplementary material, but the main text does not list specific software dependencies with version numbers.
Experiment Setup Yes We keep the same parameter setting for all the methods. In the case of DOCTOR and ODIN where temperature scaling is allowed, we test, for each dataset, 24 different values of ϵ for each of the 11 different values of T, see (Appendix C.4.2) for the set of ranges. Table 1: For all methods, in TBB, we set T = 1 and ϵ = 0; in PBB we set : ϵα = ϵβ = 0.00035, Tα = 1, Tβ = 1.5, ϵODIN = 0 and TODIN = 1.3, ϵMHLNB = 0.0002 and TMHLNB = 1, ϵENERGY = 0 and TENERGY = 1.