Active Assessment of Prediction Services as Accuracy Surface Over Attribute Combinations

Authors: Vihari Piratla, Soumen Chakrabarti, Sunita Sarawagi

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We report on extensive experiments using several real data sets. Comparison with several estimators based on Bernoulli arm parameters, Beta densities per arm, and even simpler forms of GPs on the arm Beta distributions, shows that AAA is superior at quickly cutting down arm accuracy uncertainty.
Researcher Affiliation Academia Vihari Piratla Soumen Chakrabarty Sunita Sarawagi Department of Computer Science Indian Institute of Technology, Bombay
Pseudocode Yes Detailed pseudo-code of AAA is given in the Appendix I.
Open Source Code Yes Our code and dataset can be found at: https://github.com/vihari/AAA.
Open Datasets Yes We experiment with two real data sets and tasks. Our two tasks are male-female gender classification with two classes and animal classification with 10 classes. Male-Female classification (MF): Celeb A [17] is a popular celebrity faces and attribute data set... Animal classification (AC): COCO-Stuff [18] provides an image collection.
Dataset Splits Yes Warm start: We start with 500 examples having gold attributes+labels to warm start all our experiments. The random seed also picks this random subset of 500 labeled examples.
Hardware Specification No The paper does not specify the hardware used for its experiments, such as particular GPU or CPU models.
Software Dependencies No The paper mentions 'GPy Torch [10]' as a library but does not provide a specific version number. Other mentions like 'Res Net-50 model' refer to architectures, not specific software versions.
Experiment Setup Yes Warm start: We start with 500 examples having gold attributes+labels to warm start all our experiments. The random seed also picks this random subset of 500 labeled examples. We calculate the overall accuracy of the classifier on these warm start examples as ˆρ = (Pi 1). For all arms we warm start their observation with ca = λˆρ, na = λ where λ = 0.1, a randomly picked low value. All the numbers reported here are averaged over three runs each with different random seed. The initial set of warm-start examples (D) is also changed between the runs. In the case of Beta GP-SLP, for any arm with observation count below 5, we mean pool from its three closest neighbours.