Efficient nonmyopic batch active search

Authors: Shali Jiang, Gustavo Malkomes, Matthew Abbott, Benjamin Moseley, Roman Garnett

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct thorough experiments on data from three application domains: a citation network, material science, and drug discovery, testing all proposed policies with a wide range of batch sizes. Our results demonstrate that the empirical performance gap matches our theoretical bound, that nonmyopic policies usually significantly outperform myopic alternatives, and that diversity is an important consideration for batch policy design.
Researcher Affiliation Collaboration Shali Jiang CSE, WUSTL St. Louis, MO 63130 jiang.s@wustl.edu Gustavo Malkomes CSE, WUSTL St. Louis, MO 63130 luizgustavo@wustl.edu Matthew Abbott CSE, WUSTL St. Louis, MO 63130 mbabbott@wustl.edu Benjamin Moseley Tepper School of Business, CMU and Relational AI Pittsburgh, PA 15213 moseleyb@andrew.cmu.edu Roman Garnett CSE, WUSTL St. Louis, MO 63130 garnett@wustl.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes We implement all these policies with the MATLAB active learning toolbox.1 https://github.com/rmgarnett/active_learning
Open Datasets Yes We consider the first ten of the 120 datasets used in [7, 12] and only the ECFP4 fingerprint, which showed the best performance in those studies. These datasets share a pool of 100 000 negative compounds randomly selected from the ZINC database [20].
Dataset Splits No The paper describes its experimental setup in terms of budget and repetitions, but it does not specify explicit training, validation, and test dataset splits in the conventional machine learning sense.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory, or cloud instances) used for running the experiments.
Software Dependencies No The paper mentions using 'the MATLAB active learning toolbox' but does not specify version numbers for MATLAB or the toolbox, which are required for reproducible software dependencies.
Experiment Setup Yes We use k nearest neighbor (k-nn) with k = 100 as our probability model for the drug discovery datasets, and k = 50 for the other two datasets (following the studies in [7, 12]). For each dataset, we start with one random initial positive seed observation and repeat the experiment 20 times. ... The budget is set as T = 500. We test batch-ENS with 16 and 32 samples, coded as batch-ENS-16 and batch-ENS-32.