Similarity Search for Efficient Active Learning and Search of Rare Concepts
Authors: Cody Coleman, Edward Chou, Julian Katz-Samuels, Sean Culatana, Peter Bailis, Alexander C. Berg, Robert Nowak, Roshan Sumbaly, Matei Zaharia, I. Zeki Yalniz6402-6410
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate several selection strategies in this setting on three large-scale computer vision datasets: Image Net, Open Images, and a de-identified and aggregated dataset of 10 billion publicly shared images provided by a large internet company. Our approach achieved similar m AP and recall as the traditional global approach while reducing the computational cost of selection by up to three orders of magnitude, enabling web-scale active learning. |
| Researcher Affiliation | Collaboration | 1Stanford University, 2Facebook AI, 3University of Wisconsin, 4Facebook AI Research |
| Pseudocode | Yes | Algorithm 1: BASELINE APPROACH |
| Open Source Code | No | No explicit statement about providing open-source code for the methodology or a link to a code repository. |
| Open Datasets | Yes | We applied SEALS to three selection strategies (Max Ent, MLP, and ID) and performed active learning and search on three separate datasets: Image Net [Russakovsky et al. 2015], Open Images [Kuznetsova et al. 2020], and a de-identified and aggregated dataset of 10 billion publicly shared images (Table 1). |
| Dataset Splits | No | No explicit, reproducible training/test/validation dataset splits are provided for the entire datasets. Specific subsets are mentioned for validation and testing (e.g., 'The 50,000 validation images were used as the test set', 'predefined test split'). |
| Hardware Specification | No | No specific hardware details (e.g., exact GPU/CPU models, memory amounts) are provided. Mentions 'a single 24-core machine' or 'a cluster with tens of thousands of cores' which are not specific enough. |
| Software Dependencies | No | No specific software dependencies with version numbers are provided. Mentions Faiss and Res Net-50 but without version details. |
| Experiment Setup | Yes | Each experiment started with 5 positive examples... Negative examples were randomly selected at a ratio of 19 negative examples to every positive example to form a seed set L0 r with 5 positives and 95 negatives. The batch size b for each selection round was the same as the size of the initial seed set (i.e., 100 examples), and the max labeling budget T was 2,000 examples. As the binary classifier for each concept Ar, we used logistic regression trained on the embedded examples. |