DOCTOR: A Simple Method for Detecting Misclassification Errors
Authors: Federica Granese, Marco Romanelli, Daniele Gorla, Catuscia Palamidessi, Pablo Piantanida
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that DOCTOR outperforms all state-of-the-art methods on various well-known images and sentiment analysis datasets. In this section we present a collection of experimental results to investigate the effectiveness of DOCTOR, by applying it to several benchmark datasets. |
| Researcher Affiliation | Academia | Federica Granese Lix, Inria, Institute Polytechnique de Paris, Sapienza University of Rome federica.granese@inria.fr Marco Romanelli L2S, Centrale Supélec, CNRS, Université Paris Saclay marco.romanelli@centralesupelec.fr Daniele Gorla Sapienza University of Rome gorla@di.uniroma1.it Catuscia Palamidessi Lix, Inria, Institute Polytechnique de Paris, catuscia@lix.polytechnique.fr Pablo Piantanida L2S, Centrale Supélec, CNRS, Université Paris Saclay pablo.piantanida@centralesupelec.fr |
| Pseudocode | No | The paper defines mathematical expressions and discriminators but does not present them as structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We provide publicly available code1 to reproduce our results, and we give further details on the environment, the parameter setting and the experimental setup in the Supplementary material (Appendix C). 1https://github.com/doctor-public-submission/DOCTOR/ |
| Open Datasets | Yes | Datasets and pre-trained networks. We run experiments on both image and textual datasets. We use CIFAR10 and CIFAR100 [18], Tiny Image Net [16] and SVHN [27] as image datasets; IMDb [25], Amazon Fashion and Amazon Software [28] as textual datasets. |
| Dataset Splits | No | According to our framework, no validation samples are available; consequently, in order to be consistent across the datasets, we only report the experimental settings and values for which, on average, we obtain favorable results for all the considered domains (cf. Figure 2). |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions providing details on the 'environment' in the Supplementary material, but the main text does not list specific software dependencies with version numbers. |
| Experiment Setup | Yes | We keep the same parameter setting for all the methods. In the case of DOCTOR and ODIN where temperature scaling is allowed, we test, for each dataset, 24 different values of ϵ for each of the 11 different values of T, see (Appendix C.4.2) for the set of ranges. Table 1: For all methods, in TBB, we set T = 1 and ϵ = 0; in PBB we set : ϵα = ϵβ = 0.00035, Tα = 1, Tβ = 1.5, ϵODIN = 0 and TODIN = 1.3, ϵMHLNB = 0.0002 and TMHLNB = 1, ϵENERGY = 0 and TENERGY = 1. |