Active Testing: An Efficient and Robust Framework for Estimating Accuracy
Authors: Phuc Nguyen, Deva Ramanan, Charless Fowlkes
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our active testing framework on two specific applications, multi-label classification and instance segmentation. For each of these applications, we describe the datasets and systems evaluated and the specifics of the estimators and vetting strategies used. We measure the estimation accuracy of different combination of vetting strategies and estimators at different amount of vetting efforts. We compute the absolute error between the estimated metric and the true (fully vetted) metric and average over all classes. Averaging the absolute estimation error across classes prevents over-estimation for one class canceling out under-estimation from another class. We plot the mean and the standard deviation over 50 simulation runs of each active testing approach. |
| Researcher Affiliation | Academia | 1University of California, Irvine 2Carnegie Mellon University. Correspondence to: Phuc Nguyen <nguyenpx@uci.edu>. |
| Pseudocode | Yes | Algorithm 1 Active Testing Algorithm Input: unvetted set U, vetted set V , total budget T, vetting strategy V S, system scores S = {si}... |
| Open Source Code | No | The paper does not provide any statement or link regarding the public availability of its source code. |
| Open Datasets | Yes | NUS-WIDE: This dataset contains 269,648 Flickr images with 5018 unique tags. The authors also provide a semi-complete ground-truth via manual annotations for 81 concepts. (Chua et al., 2009); Micro-videos: ... (Nguyen et al., 2016) formulated a multi-label video-retrieval/annotation task for a large collection of Vine videos. They introduce a micro-video dataset, MV-85k containing 260K videos with 58K tags. ; COCO Minival: For instance segmentation, we use minival2014 subset of the COCO dataset (Lin et al., 2014). |
| Dataset Splits | No | The paper discusses training and testing but does not explicitly mention or describe a dedicated validation dataset split for hyperparameter tuning or model selection. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware (e.g., CPU, GPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions software components and models like 'ResNet-50', 'Mask R-CNN', and 'χ2-SVM', but it does not provide specific version numbers for these or any other software dependencies needed for reproducibility. |
| Experiment Setup | Yes | For tagging, we estimate the posterior over unvetted tags, p(zi|O), based on two pieces of observed information: the statistics of noisy labels yi on vetted examples, and the system confidence score, si. This posterior probability can be derived as (see supplement for proof): p(zi|si, yi) = p(yi|zi)p(zi|si) / Sum_{v {0,1}} p(yi|zi = v)p(zi = v|si) Given some vetted data, we fit the tag-flipping priors p(yi|zi) by standard maximum likelihood estimation (counting frequencies). The posterior probabilities of the true label given the classifier confidence score, p(zi|si), is fit using logistic regression. and To compute the probability whether a detection will pass the Io U threshold with a bounding box unvetted ground-truth instance (p(zi|O) in Eq. 9), we train a χ2-SVM using the vetted portion of the database. The features for an example includes the category id, the noisy Io U estimate, the size of the bounding box containing the detection mask and the size of ground-truth bounding box. The training label is true whether the true Io U estimate, computed using the vetted ground-truth mask and the detection masks, is above a certain input Io U threshold. |