reproducibilityindex.ai

DOCTOR: A Simple Method for Detecting Misclassification Errors

Authors: Federica Granese, Marco Romanelli, Daniele Gorla, Catuscia Palamidessi, Pablo Piantanida

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we show that DOCTOR outperforms all state-of-the-art methods on various well-known images and sentiment analysis datasets. In this section we present a collection of experimental results to investigate the effectiveness of DOCTOR, by applying it to several benchmark datasets.
Researcher Affiliation	Academia	Federica Granese Lix, Inria, Institute Polytechnique de Paris, Sapienza University of Rome federica.granese@inria.fr Marco Romanelli L2S, Centrale Supélec, CNRS, Université Paris Saclay marco.romanelli@centralesupelec.fr Daniele Gorla Sapienza University of Rome gorla@di.uniroma1.it Catuscia Palamidessi Lix, Inria, Institute Polytechnique de Paris, catuscia@lix.polytechnique.fr Pablo Piantanida L2S, Centrale Supélec, CNRS, Université Paris Saclay pablo.piantanida@centralesupelec.fr
Pseudocode	No	The paper defines mathematical expressions and discriminators but does not present them as structured pseudocode or algorithm blocks.
Open Source Code	Yes	We provide publicly available code1 to reproduce our results, and we give further details on the environment, the parameter setting and the experimental setup in the Supplementary material (Appendix C). 1https://github.com/doctor-public-submission/DOCTOR/
Open Datasets	Yes	Datasets and pre-trained networks. We run experiments on both image and textual datasets. We use CIFAR10 and CIFAR100 [18], Tiny Image Net [16] and SVHN [27] as image datasets; IMDb [25], Amazon Fashion and Amazon Software [28] as textual datasets.
Dataset Splits	No	According to our framework, no validation samples are available; consequently, in order to be consistent across the datasets, we only report the experimental settings and values for which, on average, we obtain favorable results for all the considered domains (cf. Figure 2).
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies	No	The paper mentions providing details on the 'environment' in the Supplementary material, but the main text does not list specific software dependencies with version numbers.
Experiment Setup	Yes	We keep the same parameter setting for all the methods. In the case of DOCTOR and ODIN where temperature scaling is allowed, we test, for each dataset, 24 different values of ϵ for each of the 11 different values of T, see (Appendix C.4.2) for the set of ranges. Table 1: For all methods, in TBB, we set T = 1 and ϵ = 0; in PBB we set : ϵα = ϵβ = 0.00035, Tα = 1, Tβ = 1.5, ϵODIN = 0 and TODIN = 1.3, ϵMHLNB = 0.0002 and TMHLNB = 1, ϵENERGY = 0 and TENERGY = 1.