Diversity Enhanced Active Learning with Strictly Proper Scoring Rules

Authors: Wei Tan, Lan Du, Wray Buntine

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental evaluation then explores how these different acquisition functions perform. The results show that the use of mean square error and log probability with BEMPS yields robust acquisition functions, which consistently outperform the others tested.
Researcher Affiliation Academia Wei Tan Monash University wei.tan2@monash.edu Lan Du Monash University lan.du@monash.edu Wray Buntine Monash University wray.buntine@monash.edu
Pseudocode Yes Algorithm 1 Estimating point-wise Q(x|L, x ) with Equation (6), Algorithm 2 Estimate of argmaxx U Q(x|L), Algorithm 3 Finding a diverse batch
Open Source Code Yes Our implementation of BEMPS can be downloaded from https://github.com/davidtw999/BEMPS.
Open Datasets Yes We used four benchmark text datasets for three different classification tasks: topic classification, sentence classification, and sentiment analysis, as shown in Table 1. The AG NEWS for topic classification contains 120K texts of four balanced classes [41]. The PUBMED 20k was used for sentence classification [3], which contains about 20K medical abstracts with five categories. For sentiment analysis, we used both the SST-5 and the IMDB datasets. SST-5 contains 11K sentences extracted from movie reviews with five imbalanced sentiment labels [33], and IMDB contains 50K movie reviews with two balanced classes [18].
Dataset Splits Yes Meanwhile, the initial training and validation split contain only 20 and 6 samples respectively.
Hardware Specification Yes All experiments were run on 8 Tesla 16GB V100 GPUs.
Software Dependencies No The paper mentions software like Distil BERT, Adam W, but does not specify their version numbers or other software dependencies with version information required for reproducibility.
Experiment Setup Yes We fine-tuned Distil BERT on each dataset after each AL iteration with a random re-initialization [5]... The maximum sequence length was set to 128, and a maximum of 30 epochs was used in fine-tuning Distil BERT with early stopping [4]. We used Adam W [15] as the optimizer with learning rate 2e-5 and betas 0.9/0.999. Each AL method was run for five times with different random number seeds on each dataset. The batch size B was set to {1, 5, 10, 50, 100}.