Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Revisiting Active Learning in the Era of Vision Foundation Models

Authors: Sanket Rajan Gupte, Josiah Aklilu, Jeffrey J Nirschl, Serena Yeung-Levy

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we evaluate how foundation models influence three critical components of effective AL, namely, 1) initial labeled pool selection, 2) ensuring diverse sampling, and 3) the trade-off between representative and uncertainty sampling. We systematically study how the robust representations of foundation models (DINOv2, Open CLIP) challenge existing findings in active learning. Our observations inform the principled construction of a new simple and elegant AL strategy that balances uncertainty estimated via dropout with sample diversity. We extensively test our strategy on many challenging image classification benchmarks, including natural images as well as out-of-domain biomedical images that are relatively understudied in the AL literature.
Researcher Affiliation	Academia	Sanket Rajan Gupte EMAIL Department of Computer Science Stanford University; Josiah Aklilu EMAIL Department of Biomedical Data Science Stanford University; Jeffrey J. Nirschl EMAIL Department of Pathology Stanford University; Serena Yeung-Levy EMAIL Department of Biomedical Data Science Stanford University
Pseudocode	Yes	A detailed description of the query strategy is provided in Algorithm 1. Below, we briefly review the key results from section 3 that motivate the choice of components and the construction of Drop Query . Algorithm 1 Drop Query Input: Given unlabeled instances zi ZU, external oracle ϕ(.), budget B. Output: Queried labels Y = {yi : i 1, . . . , B}
Open Source Code	Yes	We also provide a highly performant and efficient implementation of modern AL strategies (including our method) at https://github.com/sanketx/AL-foundation-models.
Open Datasets	Yes	Table 1: Effect of initial pool selection on performance. Test set accuracy using our centroid-based initialization. ... We show AL iterations t for datasets CIFAR100 (Krizhevsky, 2009), Food101 (Bossard et al., 2014), Image Net-100 (Gansbeke et al., 2020), and Domain Net-Real (Peng et al., 2019) (from top to bottom) with DINOv2 Vi T-g14 as the feature extractor f. Our proposed AL strategy is evaluated through a comprehensive set of experiments on diverse natural image datasets sourced from the VTAB+ benchmark (Schuhmann et al., 2022). Among the fine-grained natural image classification datasets, in Stanford Cars (Krause et al., 2013) and Oxford-IIIT Pets (Parkhi et al., 2012), our Drop Query outperforms all other AL queries in each iteration while also outperforming complex query approaches like Alfa-Mix and Typiclust in the low-budget regime in FVGC Aircraft (Maji et al., 2013) (see Figure 2). Our approach, which is agnostic to dataset and model, outperforms the state-of-the-art AL queries, which often necessitate additional hyperparameter tuning given the underlying data distribution. We also test our method on a large-scale dataset with 365 classes, Places365 (Zhou et al., 2017) (which contains approximately 1.8 million images), and our strategy beats all other modern AL queries. Figure 2: ... (Bottom row) AL curves for biomedical datasets, including images of peripheral blood smears (Acevedo et al., 2020), retinal fundoscopy (Kaggle & Eye Pacs, 2015), He La cell structures (Murphy et al., 2000), and skin dermoscopy skin (Tschandl et al., 2018), covering pathology, ophthalmology, cell biology, and dermatology domains using various imaging modalities.
Dataset Splits	No	The paper mentions a held-out test set (Dtest) and a training pool (Dpool) for the active learning process, but does not provide specific details on how these initial splits are made (e.g., percentages, sample counts, or references to standard split methodologies for each dataset used). It defines the active learning loop's query budget and iterations but not the initial partitioning of the entire dataset into these components.
Hardware Specification	No	The paper does not provide specific details regarding the hardware used for running the experiments (e.g., specific GPU models, CPU types, or memory specifications). While it discusses large vision models and their performance, the computational resources used for the experiments are not specified.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies or libraries used (e.g., Python, PyTorch, TensorFlow, or CUDA versions). It focuses on the methodology and experimental results without detailing the software environment.
Experiment Setup	Yes	Following the recommendations of (Munjal et al., 2022), we apply strong regularization to our models in the form of weight decay (1e-2) and aggressive dropout (0.75). ... In our experiments, we set a query budget B of 1 sample per class per iteration. For instance, the CIFAR100 dataset contains images corresponding to 100 classes, so the query budget is set to 100 samples. ... We run our active learning loop for 20 iterations and average our results over 5 random seeds for each experiment. Uncertainy-based sampling approach: Given features zi of an unlabeled instance, we produce M dropout samples of these inputs (ρ = 0.75) to get {z1 i , . . . , z M i }. ... In all of our experiments, we set M = 3.