reproducibilityindex.ai

Diversity Enhanced Active Learning with Strictly Proper Scoring Rules

Authors: Wei Tan, Lan Du, Wray Buntine

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental evaluation then explores how these different acquisition functions perform. The results show that the use of mean square error and log probability with BEMPS yields robust acquisition functions, which consistently outperform the others tested.
Researcher Affiliation	Academia	Wei Tan Monash University wei.tan2@monash.edu Lan Du Monash University lan.du@monash.edu Wray Buntine Monash University wray.buntine@monash.edu
Pseudocode	Yes	Algorithm 1 Estimating point-wise Q(x\|L, x ) with Equation (6), Algorithm 2 Estimate of argmaxx U Q(x\|L), Algorithm 3 Finding a diverse batch
Open Source Code	Yes	Our implementation of BEMPS can be downloaded from https://github.com/davidtw999/BEMPS.
Open Datasets	Yes	We used four benchmark text datasets for three different classiﬁcation tasks: topic classiﬁcation, sentence classiﬁcation, and sentiment analysis, as shown in Table 1. The AG NEWS for topic classiﬁcation contains 120K texts of four balanced classes [41]. The PUBMED 20k was used for sentence classiﬁcation [3], which contains about 20K medical abstracts with ﬁve categories. For sentiment analysis, we used both the SST-5 and the IMDB datasets. SST-5 contains 11K sentences extracted from movie reviews with ﬁve imbalanced sentiment labels [33], and IMDB contains 50K movie reviews with two balanced classes [18].
Dataset Splits	Yes	Meanwhile, the initial training and validation split contain only 20 and 6 samples respectively.
Hardware Specification	Yes	All experiments were run on 8 Tesla 16GB V100 GPUs.
Software Dependencies	No	The paper mentions software like Distil BERT, Adam W, but does not specify their version numbers or other software dependencies with version information required for reproducibility.
Experiment Setup	Yes	We ﬁne-tuned Distil BERT on each dataset after each AL iteration with a random re-initialization [5]... The maximum sequence length was set to 128, and a maximum of 30 epochs was used in ﬁne-tuning Distil BERT with early stopping [4]. We used Adam W [15] as the optimizer with learning rate 2e-5 and betas 0.9/0.999. Each AL method was run for ﬁve times with different random number seeds on each dataset. The batch size B was set to {1, 5, 10, 50, 100}.