reproducibilityindex.ai

Active Assessment of Prediction Services as Accuracy Surface Over Attribute Combinations

Authors: Vihari Piratla, Soumen Chakrabarti, Sunita Sarawagi

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We report on extensive experiments using several real data sets. Comparison with several estimators based on Bernoulli arm parameters, Beta densities per arm, and even simpler forms of GPs on the arm Beta distributions, shows that AAA is superior at quickly cutting down arm accuracy uncertainty.
Researcher Affiliation	Academia	Vihari Piratla Soumen Chakrabarty Sunita Sarawagi Department of Computer Science Indian Institute of Technology, Bombay
Pseudocode	Yes	Detailed pseudo-code of AAA is given in the Appendix I.
Open Source Code	Yes	Our code and dataset can be found at: https://github.com/vihari/AAA.
Open Datasets	Yes	We experiment with two real data sets and tasks. Our two tasks are male-female gender classiﬁcation with two classes and animal classiﬁcation with 10 classes. Male-Female classiﬁcation (MF): Celeb A [17] is a popular celebrity faces and attribute data set... Animal classiﬁcation (AC): COCO-Stuff [18] provides an image collection.
Dataset Splits	Yes	Warm start: We start with 500 examples having gold attributes+labels to warm start all our experiments. The random seed also picks this random subset of 500 labeled examples.
Hardware Specification	No	The paper does not specify the hardware used for its experiments, such as particular GPU or CPU models.
Software Dependencies	No	The paper mentions 'GPy Torch [10]' as a library but does not provide a specific version number. Other mentions like 'Res Net-50 model' refer to architectures, not specific software versions.
Experiment Setup	Yes	Warm start: We start with 500 examples having gold attributes+labels to warm start all our experiments. The random seed also picks this random subset of 500 labeled examples. We calculate the overall accuracy of the classiﬁer on these warm start examples as ˆρ = (Pi 1). For all arms we warm start their observation with ca = λˆρ, na = λ where λ = 0.1, a randomly picked low value. All the numbers reported here are averaged over three runs each with different random seed. The initial set of warm-start examples (D) is also changed between the runs. In the case of Beta GP-SLP, for any arm with observation count below 5, we mean pool from its three closest neighbours.