reproducibilityindex.ai

When Does Confidence-Based Cascade Deferral Suffice?

Authors: Wittawat Jitkrittum, Neha Gupta, Aditya K. Menon, Harikrishna Narasimhan, Ankit Rawat, Sanjiv Kumar

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we provide empirical evidence to support our analysis in 4.1 by considering the three failure modes in which conﬁdence-based deferral underperforms. For each of these settings, we compute deferral curves that plot the classiﬁcation accuracy versus the fraction of samples deferred to the second model (which implicitly measures the overall compute cost). In line with our analysis in 4.1, post-hoc deferral rules offer better accuracy-cost trade-offs in these settings.
Researcher Affiliation	Industry	Wittawat Jitkrittum Neha Gupta Aditya Krishna Menon Harikrishna Narasimhan Ankit Singh Rawat Sanjiv Kumar Google Research, New York {wittawat, nehagup, adityakmenon, hnarasimhan, ankitsrawat, sanjivk}@google.com
Pseudocode	Yes	Algorithm 1 Conﬁdence-based cascades of K classiﬁers Input: K 2 classiﬁers: p(1), . . . , p(K) : X L, thresholds c(1), . . . , c(K 1) [0, 1] Input: An input instance x X 1: for k = 1, 2, . . . , K 1 do 2: if maxy p(k) y (x) > c(k) then 3: Predict class ˆy = argmaxy p(k) y (x) 4: break 5: end if 6: end for 7: Predict class ˆy = argmaxy p(K) y (x)
Open Source Code	No	The paper does not contain any explicit statement about open-sourcing the code for the methodology or provide a link to a code repository.
Open Datasets	Yes	We use Mobile Net V2 [65] as h(1), and a larger Efﬁcient Net B0 [71] as h(2). For hyperparameter details, see Appendix C. [...] We consider CIFAR 100 dataset where training examples from pre-chosen Lnoise {0, 10, 25} classes are assigned a uniformly drawn label. [...] To simulate distribution shift, we consider a long-tailed version of CIFAR 100 [40] where there are h {100, 50, 25} head classes, and 100 h tail classes. Each head class has 500 training images, and each tail class has 50 training images.
Dataset Splits	Yes	We train g on a held-out validation set. For full technical details of the post-hoc model architecture and training, see Appendix C. [...] Figure 1 shows the accuracy of conﬁdence-based deferral (Conﬁdence) and Relative Conﬁdence (Equation (6)) on the standard Image Net test set as a function of the deferral rate. [...] We consider CIFAR 100 dataset where training examples from pre-chosen Lnoise {0, 10, 25} classes are assigned a uniformly drawn label. The case of Lnoise = 0 corresponds to the standard CIFAR 100 problem.
Hardware Specification	No	The paper mentions using specific models like 'Mobile Net V2' and 'Efﬁcient Net B0' but does not provide any details about the hardware (e.g., CPU, GPU, memory) used to run the experiments or train these models.
Software Dependencies	No	The paper mentions 'Multi Layer Perceptron (MLP)' and 'Adam' optimizer, but does not provide specific version numbers for any software libraries (e.g., TensorFlow, PyTorch, scikit-learn) or programming languages used.
Experiment Setup	Yes	For full technical details of the post-hoc model architecture and training, see Appendix C. We use the objectives described in Table 1 to train g. [...] For a post-hoc approach to be practical, the overhead from invoking a post-hoc model must be small relative to the costs of h(1) and h(2). To this end, in all of the following experiments, the post-hoc model g: X R is based on a lightweight, three-layer Multi Layer Perceptron (MLP) that takes as input the probability outputs from model 1. That is, g(x) = MLP(p(1)(x)) where p(1)(x) L denotes all probability outputs from model 1. Learning g amounts to learning the MLP as the two base models are ﬁxed. We train g on a held-out validation set.