Efficient nonmyopic batch active search
Authors: Shali Jiang, Gustavo Malkomes, Matthew Abbott, Benjamin Moseley, Roman Garnett
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct thorough experiments on data from three application domains: a citation network, material science, and drug discovery, testing all proposed policies with a wide range of batch sizes. Our results demonstrate that the empirical performance gap matches our theoretical bound, that nonmyopic policies usually significantly outperform myopic alternatives, and that diversity is an important consideration for batch policy design. |
| Researcher Affiliation | Collaboration | Shali Jiang CSE, WUSTL St. Louis, MO 63130 jiang.s@wustl.edu Gustavo Malkomes CSE, WUSTL St. Louis, MO 63130 luizgustavo@wustl.edu Matthew Abbott CSE, WUSTL St. Louis, MO 63130 mbabbott@wustl.edu Benjamin Moseley Tepper School of Business, CMU and Relational AI Pittsburgh, PA 15213 moseleyb@andrew.cmu.edu Roman Garnett CSE, WUSTL St. Louis, MO 63130 garnett@wustl.edu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We implement all these policies with the MATLAB active learning toolbox.1 https://github.com/rmgarnett/active_learning |
| Open Datasets | Yes | We consider the first ten of the 120 datasets used in [7, 12] and only the ECFP4 fingerprint, which showed the best performance in those studies. These datasets share a pool of 100 000 negative compounds randomly selected from the ZINC database [20]. |
| Dataset Splits | No | The paper describes its experimental setup in terms of budget and repetitions, but it does not specify explicit training, validation, and test dataset splits in the conventional machine learning sense. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory, or cloud instances) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'the MATLAB active learning toolbox' but does not specify version numbers for MATLAB or the toolbox, which are required for reproducible software dependencies. |
| Experiment Setup | Yes | We use k nearest neighbor (k-nn) with k = 100 as our probability model for the drug discovery datasets, and k = 50 for the other two datasets (following the studies in [7, 12]). For each dataset, we start with one random initial positive seed observation and repeat the experiment 20 times. ... The budget is set as T = 500. We test batch-ENS with 16 and 32 samples, coded as batch-ENS-16 and batch-ENS-32. |