Active Assessment of Prediction Services as Accuracy Surface Over Attribute Combinations
Authors: Vihari Piratla, Soumen Chakrabarti, Sunita Sarawagi
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We report on extensive experiments using several real data sets. Comparison with several estimators based on Bernoulli arm parameters, Beta densities per arm, and even simpler forms of GPs on the arm Beta distributions, shows that AAA is superior at quickly cutting down arm accuracy uncertainty. |
| Researcher Affiliation | Academia | Vihari Piratla Soumen Chakrabarty Sunita Sarawagi Department of Computer Science Indian Institute of Technology, Bombay |
| Pseudocode | Yes | Detailed pseudo-code of AAA is given in the Appendix I. |
| Open Source Code | Yes | Our code and dataset can be found at: https://github.com/vihari/AAA. |
| Open Datasets | Yes | We experiment with two real data sets and tasks. Our two tasks are male-female gender classification with two classes and animal classification with 10 classes. Male-Female classification (MF): Celeb A [17] is a popular celebrity faces and attribute data set... Animal classification (AC): COCO-Stuff [18] provides an image collection. |
| Dataset Splits | Yes | Warm start: We start with 500 examples having gold attributes+labels to warm start all our experiments. The random seed also picks this random subset of 500 labeled examples. |
| Hardware Specification | No | The paper does not specify the hardware used for its experiments, such as particular GPU or CPU models. |
| Software Dependencies | No | The paper mentions 'GPy Torch [10]' as a library but does not provide a specific version number. Other mentions like 'Res Net-50 model' refer to architectures, not specific software versions. |
| Experiment Setup | Yes | Warm start: We start with 500 examples having gold attributes+labels to warm start all our experiments. The random seed also picks this random subset of 500 labeled examples. We calculate the overall accuracy of the classifier on these warm start examples as ˆρ = (Pi 1). For all arms we warm start their observation with ca = λˆρ, na = λ where λ = 0.1, a randomly picked low value. All the numbers reported here are averaged over three runs each with different random seed. The initial set of warm-start examples (D) is also changed between the runs. In the case of Beta GP-SLP, for any arm with observation count below 5, we mean pool from its three closest neighbours. |