reproducibilityindex.ai

To Trust Or Not To Trust A Classifier

Authors: Heinrich Jiang, Been Kim, Melody Guan, Maya Gupta

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we empirically test whether trust scores can both detect examples that are incorrectly classiﬁed with high precision and be used as a signal to determine which examples are likely correctly classiﬁed. We perform this evaluation across (i) different datasets (Sections 5.1 and 5.3), (ii) different families of classiﬁers (neural network, random forest and logistic regression) (Section 5.1), (iii) classiﬁers with varying accuracy on the same task (Section 5.2) and (iv) different representations of the data e.g. input data or activations of various intermediate layers in neural network (Section 5.3).
Researcher Affiliation	Collaboration	Heinrich Jiang Google Research heinrichj@google.com Been Kim Google Brain beenkim@google.com Melody Y. Guan Stanford University mguan@stanford.edu Maya Gupta Google Research mayagupta@google.com
Pseudocode	Yes	Algorithm 1 Estimating α-high-density-set Parameters: α (density threshold), k. Inputs: Sample points X := {x1, .., xn} drawn from f. Deﬁne k-NN radius rk(x) := inf{r > 0 : \|B(x, r) X\| k} and let ε := inf{r > 0 : \|{x X : rk(x) > r}\| α n}. return c Hα(f) := {x X : rk(x) ε}. Algorithm 2 Trust Score Parameters: α (density threshold), k. Input: Classiﬁer h : X Y. Training data (x1, y1), ..., (xn, yn). Test example x. For each ℓ Y, let c Hα(fℓ) be the output of Algorithm 1 with parameters α, k and sample points {xj : 1 j n, yj = ℓ}. Then, return the trust score, deﬁned as: ξ(h, x) := d x, c Hα(feh(x)) /d x, c Hα(fh(x)) , where eh(x) = argminl Y,l =h(x) d x, c Hα(fl) .
Open Source Code	Yes	An open-source implementation of Trust Scores can be found here: https://github.com/google/Trust Score
Open Datasets	Yes	The MNIST handwritten digit dataset [48] consists of 60,000 28 28-pixel training images and 10,000 testing images in 10 classes. The SVHN dataset [49] consists of 73,257 32 32-pixel colour training images and 26,032 testing images and also has 10 classes. The CIFAR-10 and CIFAR-100 datasets [50] both consist of 60,000 32 32-pixel colour images, with 50,000 training images and 10,000 test images.
Dataset Splits	Yes	For each run we took a random stratiﬁed split of the dataset into two halves. One portion was used for training the trust score and the other was used for evaluation and the standard error is shown in addition to the average precision across the runs at each percentile level. ... The MNIST handwritten digit dataset [48] consists of 60,000 28 28-pixel training images and 10,000 testing images in 10 classes. The SVHN dataset [49] consists of 73,257 32 32-pixel colour training images and 26,032 testing images and also has 10 classes. The CIFAR-10 and CIFAR-100 datasets [50] both consist of 60,000 32 32-pixel colour images, with 50,000 training images and 10,000 test images.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, memory, or other computer specifications used for running the experiments. It only mentions using pretrained models and networks implemented in Keras.
Software Dependencies	No	The paper mentions Keras [53] as the framework used (François Chollet et al. Keras. https://github.com/fchollet/keras, 2015.) but does not specify a version number for Keras or any other software dependencies with version numbers.
Experiment Setup	Yes	Throughout our experiments, we ﬁx k = 10, and use cross-validation to select α as it is data-dependent. ... The CIFAR-10 VGG-16 network achieves a test accuracy of 93.56% while the CIFAR-100 network achieves a test accuracy of 70.48%. We used pretrained, smaller CNNs for MNIST and SVHN. The MNIST network achieves a test accuracy of 99.07% and the SVHN network achieves a test accuracy of 95.45%. All architectures were implemented in Keras [53]. ... As input to the trust score, we tried using 1) the logit layer, 2) the preceding fully connected layer with Re LU activation, 3) this fully connected layer, which has 128 dimensions in the MNIST network and 512 dimensions in the other networks, reduced down to 20 dimensions from applying PCA. ... All plots were made using α = 0; using crossvalidation to select a different α did not improve trust score performance.