Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Stochastic Batch Acquisition: A Simple Baseline for Deep Active Learning

Authors: Andreas Kirsch, Sebastian Farquhar, Parmida Atighehchian, Andrew Jesson, Frédéric Branchaud-Charron, Yarin Gal

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we empirically verify that the presented stochastic acquisition methods (a) outperform top-K acquisition and (b) are competitive with specially designed batch acquisition schemes like BADGE (Ash et al., 2020) and Batch BALD (Kirsch et al., 2019); and are vastly cheaper than these more complicated methods. To demonstrate the seriousness of the possible weakness of recent batch acquisition methods, we use a range of datasets. These experiments show that the performance of the stochastic extensions is not dependent on the specific characteristics of any particular dataset. Our experiments include computer vision, natural language processing (NLP), and causal inference (in 6.1).
Researcher Affiliation	Collaboration	Andreas Kirsch EMAIL OATML, Department of Computer Science, University of Oxford Sebastian Farquhar EMAIL OATML, Department of Computer Science, University of Oxford Parmida Atighehchian EMAIL Service Now Andrew Jesson EMAIL OATML, Department of Computer Science, University of Oxford Frédéric Branchaud-Charron EMAIL Service Now Yarin Gal EMAIL OATML, Department of Computer Science, University of Oxford
Pseudocode	Yes	Listing 1: Code for stochastic batch acquisition. Colab here.
Open Source Code	Yes	Frameworks. We use Py Torch. Repeated-MNIST and EMNIST experiments use Py Torch Ignite. Synbols and MIO-TCD experiments use the Baa L library: https://github.com/baal-org/baal (Atighehchian et al., 2020). Predictive parity is calculated using Fair Learn (Bird et al., 2020). The Causal BALD experiments use https://github.com/anndvision/causal-bald (Jesson et al., 2021). The experiments comparing to ACS-FW (Pinsler et al., 2019) use the original authors implementation with added support for stochastic batch acquisitions: https://github.com/Black HC/active-bayesian-coresets/releases/ tag/stoch_batch_acq_paper. The Repeated-MNIST experiments were run using https://github.com/ Black HC/active_learning_redux/releases/tag/stoch_batch_acq, and the results are also available on Wand B (https://wandb.ai/oatml-andreas-kirsch/oatml-snow-stoch-acq). Listing 1: Code for stochastic batch acquisition. Colab here.
Open Datasets	Yes	Our experiments include computer vision, natural language processing (NLP), and causal inference (in 6.1). We show that stochastic acquisition helps avoid selecting redundant samples on Repeated-MNIST (Kirsch et al., 2019), examine performance in active learning for computer vision on EMNIST (Cohen et al., 2017), MIO-TCD (Luo et al., 2018), Synbols (Lacoste et al., 2020), and CLINC-150 (Larson et al., 2019) for intent classification in NLP. MIO-TCD is especially close to real-world datasets in size and quality. In appendix C.5, we further investigate edges cases using the Synbols dataset under different types of biases and noise, and in Appendix C.7, we also separately examine stochastic batch acquisition using last-layer MFVI models on CIFAR-10 (Krizhevsky et al., 2009), SVHN (Netzer et al., 2011), Repeated-MNIST, Fashion-MNIST (Xiao et al., 2017) and compare to ACS-FW (Pinsler et al., 2019). We follow the experiments of Jesson et al. (2021) on both synthetic data and the semi-synthetic IHDP dataset (Hill, 2011), a commonly used benchmark for causal effects estimation.
Dataset Splits	Yes	Using Monte-Carlo Dropout BNN trained on MNIST at initial 20 points and 73% initial accuracy; score ranks computed over test set. Repeated-MNIST. ... We use an acquisition size of 10 and 4 dataset repetitions. Computer Vision: EMNIST. ... We use an acquisition size of 5 for Batch BALD, and 10 otherwise. Computer Vision: MIO-TCD. ... We use an acquisition size of 100 for all methods. Natural Language Processing: CLINC-150. ... We use an acquisition size of 100, starting from an initial training set of 1510 points (10 points per intent class). MFVI Last-Layer Comparison with ACS-FW. ... We use 5000 initial training samples with an acquisition size of 4000 for CIFAR-10, 1000 initial training samples with an acquisition size of 2000 for SVHN, 20 initial training samples with an acquisition size of 100 for Repeated-MNIST, and 20 initial training samples with an acquisition size of 25 for Fashion-MNIST.
Hardware Specification	Yes	Our experiments used about 25,000 compute hours on Titan RTX GPUs. Compute. Results shown in Table 1 were run inside Docker containers with 8 CPUs (2.2Ghz) and 32 Gb of RAM. Other experiments were run on similar machines with Titan RTX GPUs.
Software Dependencies	No	Frameworks. We use Py Torch. Repeated-MNIST and EMNIST experiments use Py Torch Ignite. Synbols and MIO-TCD experiments use the Baa L library: https://github.com/baal-org/baal (Atighehchian et al., 2020). Predictive parity is calculated using Fair Learn (Bird et al., 2020). The Causal BALD experiments use https://github.com/anndvision/causal-bald (Jesson et al., 2021). The experiments comparing to ACS-FW (Pinsler et al., 2019) use the original authors implementation with added support for stochastic batch acquisitions: https://github.com/Black HC/active-bayesian-coresets/releases/ tag/stoch_batch_acq_paper. The Repeated-MNIST experiments were run using https://github.com/ Black HC/active_learning_redux/releases/tag/stoch_batch_acq, and the results are also available on Wand B (https://wandb.ai/oatml-andreas-kirsch/oatml-snow-stoch-acq). We fine-tune a pretrained Distil BERT model from Hugging Face (Wolf et al., 2020) on CLINC-150 for 5 epochs with Adam as optimiser.
Experiment Setup	Yes	Experimental Setup & Compute. We document the experimental setup and model architectures in detail in appendix C.1. C.1.2 Repeated-MNIST: ...a Le Net-5-like architecture with Re LU activations instead of tanh and added dropout... two blocks of a convolution, dropout, max-pooling, Re LU with 32 and 64 channels and 5x5 kernel size, respectively. As classifier head, a two-layer MLP with 128 hidden units (and 10 output units) is used that includes dropout between the layers. We use a dropout probability of 0.5 everywhere. The model is trained with early stopping using the Adam optimiser and a learning rate of 0.001. We sample predictions using 100 MC-Dropout samples for BALD. Weights are reinitialized after each acquisition step. Table 3: Hyper-parameters used in Section 5 and C.5 Hyperparameter Value Learning rate 0.001 Optimiser SGD Weight decay 0 Momentum 0.9 Loss function Crossentropy Training duration 10 Batch size 32 Dropout p 0.5 MC iterations 20 Query size 100 Initial set 500