reproducibilityindex.ai

Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling

Authors: Benedikt Boecking, Willie Neiswanger, Eric Xing, Artur Dubrawski

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that only a small number of feedback iterations are needed to train models that achieve highly competitive test set performance without access to ground truth training labels. We conduct user studies, which show that users are able to effectively provide feedback on heuristics and that test set results track the performance of simulated oracles.
Researcher Affiliation	Academia	Benedikt Boecking1, Willie Neiswanger2, Eric P. Xing1, & Artur Dubrawski1 1Carnegie Mellon University {boecking,epxing,awd}@cs.cmu.edu 2Stanford University neiswanger@cs.stanford.edu
Pseudocode	Yes	Algorithm 1: Interactive Weak Supervision (IWS-LSE-a). Input : L: set of LFs, T: max iterations. 2 for t = 1, 2, . . . , T do 3 λt arg maxλ L\Qt 1 ϕt(λ) Eq. (4) 4 ut Expert Query(λt) 5 Qt Qt 1 {(λt, ut)} 7 ˆL {λj L : E[p(αj\|QT )] > r} Eq. (5)
Open Source Code	Yes	1Code is available at https://github.com/benbo/interactive-weak-supervision
Open Datasets	Yes	Datasets For our text data experiments, we use three publicly available datasets 3 to deﬁne six binary text classifaction tasks. We use a subset of the Amazon Review Data (He & Mc Auley, 2016) for sentiment classiﬁcation... We use the IMDB Movie Review Sentiment dataset (Maas et al., 2011)... In addition, we use the Bias in Bios (De-Arteaga et al., 2019) dataset... For the cross-modal tasks of text captions and images as well as the pure image task we use the COCO dataset (Lin et al., 2014).
Dataset Splits	Yes	We use a subset of the Amazon Review Data (He & Mc Auley, 2016) for sentiment classiﬁcation, aggregating all categories with more than 100k reviews from which we sample 200k reviews and split them into 160k training points and 40k test points. We use the IMDB Movie Review Sentiment dataset (Maas et al., 2011) which has 25k training samples and 25k test samples. In addition, we use the Bias in Bios (De-Arteaga et al., 2019) dataset from which we create binary classiﬁcation tasks to distinguish difﬁcult pairs among frequently occurring occupations. Speciﬁcally, we create the following subsets with equally sized train and test sets: journalist or photographer (n = 32 258), professor or teacher (n = 24 588), painter or architect (n = 12 236), professor or physician (n = 54 476).
Hardware Specification	No	No specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running experiments were provided.
Software Dependencies	No	No specific software versions (e.g., library or solver names with version numbers) needed to replicate the experiment were provided. The paper mentions 'RELU activations' and optimization using 'Adam (Kingma & Ba, 2014)' but no specific software versions.
Experiment Setup	Yes	Our probabilistic ensemble in IWS, which is used in all acquisition functions to learn p(uj = 1\|Qt), is a bagging ensemble of s = 50 multilayer perceptrons with two hidden layers of size 10, RELU activations, sigmoid output and logarithmic loss. Our downstream end classiﬁer f is a multilayer perceptron with two hidden layers of size 20 and RELU activations, sigmoid output and logarithmic loss. Each model in the ensemble as well as f are optimized using Adam (Kingma & Ba, 2014). The ﬁrst 8 iterations of IWS are initialized with queries of four LFs known to have accuracy between 0.7 and 0.75 drawn at random and four randomly drawn LFs with arbitrary accuracy. Subsequently, IWS chooses the next LFs to query.