reproducibilityindex.ai

Active Testing: An Efficient and Robust Framework for Estimating Accuracy

Authors: Phuc Nguyen, Deva Ramanan, Charless Fowlkes

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our active testing framework on two speciﬁc applications, multi-label classiﬁcation and instance segmentation. For each of these applications, we describe the datasets and systems evaluated and the speciﬁcs of the estimators and vetting strategies used. We measure the estimation accuracy of different combination of vetting strategies and estimators at different amount of vetting efforts. We compute the absolute error between the estimated metric and the true (fully vetted) metric and average over all classes. Averaging the absolute estimation error across classes prevents over-estimation for one class canceling out under-estimation from another class. We plot the mean and the standard deviation over 50 simulation runs of each active testing approach.
Researcher Affiliation	Academia	1University of California, Irvine 2Carnegie Mellon University. Correspondence to: Phuc Nguyen <nguyenpx@uci.edu>.
Pseudocode	Yes	Algorithm 1 Active Testing Algorithm Input: unvetted set U, vetted set V , total budget T, vetting strategy V S, system scores S = {si}...
Open Source Code	No	The paper does not provide any statement or link regarding the public availability of its source code.
Open Datasets	Yes	NUS-WIDE: This dataset contains 269,648 Flickr images with 5018 unique tags. The authors also provide a semi-complete ground-truth via manual annotations for 81 concepts. (Chua et al., 2009); Micro-videos: ... (Nguyen et al., 2016) formulated a multi-label video-retrieval/annotation task for a large collection of Vine videos. They introduce a micro-video dataset, MV-85k containing 260K videos with 58K tags. ; COCO Minival: For instance segmentation, we use minival2014 subset of the COCO dataset (Lin et al., 2014).
Dataset Splits	No	The paper discusses training and testing but does not explicitly mention or describe a dedicated validation dataset split for hyperparameter tuning or model selection.
Hardware Specification	No	The paper does not provide specific details regarding the hardware (e.g., CPU, GPU models, memory) used for running its experiments.
Software Dependencies	No	The paper mentions software components and models like 'ResNet-50', 'Mask R-CNN', and 'χ2-SVM', but it does not provide specific version numbers for these or any other software dependencies needed for reproducibility.
Experiment Setup	Yes	For tagging, we estimate the posterior over unvetted tags, p(zi\|O), based on two pieces of observed information: the statistics of noisy labels yi on vetted examples, and the system conﬁdence score, si. This posterior probability can be derived as (see supplement for proof): p(zi\|si, yi) = p(yi\|zi)p(zi\|si) / Sum_{v {0,1}} p(yi\|zi = v)p(zi = v\|si) Given some vetted data, we ﬁt the tag-ﬂipping priors p(yi\|zi) by standard maximum likelihood estimation (counting frequencies). The posterior probabilities of the true label given the classiﬁer conﬁdence score, p(zi\|si), is ﬁt using logistic regression. and To compute the probability whether a detection will pass the Io U threshold with a bounding box unvetted ground-truth instance (p(zi\|O) in Eq. 9), we train a χ2-SVM using the vetted portion of the database. The features for an example includes the category id, the noisy Io U estimate, the size of the bounding box containing the detection mask and the size of ground-truth bounding box. The training label is true whether the true Io U estimate, computed using the vetted ground-truth mask and the detection masks, is above a certain input Io U threshold.