Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling
Authors: Benedikt Boecking, Willie Neiswanger, Eric Xing, Artur Dubrawski
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that only a small number of feedback iterations are needed to train models that achieve highly competitive test set performance without access to ground truth training labels. We conduct user studies, which show that users are able to effectively provide feedback on heuristics and that test set results track the performance of simulated oracles. |
| Researcher Affiliation | Academia | Benedikt Boecking1, Willie Neiswanger2, Eric P. Xing1, & Artur Dubrawski1 1Carnegie Mellon University {boecking,epxing,awd}@cs.cmu.edu 2Stanford University neiswanger@cs.stanford.edu |
| Pseudocode | Yes | Algorithm 1: Interactive Weak Supervision (IWS-LSE-a). Input : L: set of LFs, T: max iterations. 2 for t = 1, 2, . . . , T do 3 λt arg maxλ L\Qt 1 ϕt(λ) Eq. (4) 4 ut Expert Query(λt) 5 Qt Qt 1 {(λt, ut)} 7 ˆL {λj L : E[p(αj|QT )] > r} Eq. (5) |
| Open Source Code | Yes | 1Code is available at https://github.com/benbo/interactive-weak-supervision |
| Open Datasets | Yes | Datasets For our text data experiments, we use three publicly available datasets 3 to define six binary text classifaction tasks. We use a subset of the Amazon Review Data (He & Mc Auley, 2016) for sentiment classification... We use the IMDB Movie Review Sentiment dataset (Maas et al., 2011)... In addition, we use the Bias in Bios (De-Arteaga et al., 2019) dataset... For the cross-modal tasks of text captions and images as well as the pure image task we use the COCO dataset (Lin et al., 2014). |
| Dataset Splits | Yes | We use a subset of the Amazon Review Data (He & Mc Auley, 2016) for sentiment classification, aggregating all categories with more than 100k reviews from which we sample 200k reviews and split them into 160k training points and 40k test points. We use the IMDB Movie Review Sentiment dataset (Maas et al., 2011) which has 25k training samples and 25k test samples. In addition, we use the Bias in Bios (De-Arteaga et al., 2019) dataset from which we create binary classification tasks to distinguish difficult pairs among frequently occurring occupations. Specifically, we create the following subsets with equally sized train and test sets: journalist or photographer (n = 32 258), professor or teacher (n = 24 588), painter or architect (n = 12 236), professor or physician (n = 54 476). |
| Hardware Specification | No | No specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running experiments were provided. |
| Software Dependencies | No | No specific software versions (e.g., library or solver names with version numbers) needed to replicate the experiment were provided. The paper mentions 'RELU activations' and optimization using 'Adam (Kingma & Ba, 2014)' but no specific software versions. |
| Experiment Setup | Yes | Our probabilistic ensemble in IWS, which is used in all acquisition functions to learn p(uj = 1|Qt), is a bagging ensemble of s = 50 multilayer perceptrons with two hidden layers of size 10, RELU activations, sigmoid output and logarithmic loss. Our downstream end classifier f is a multilayer perceptron with two hidden layers of size 20 and RELU activations, sigmoid output and logarithmic loss. Each model in the ensemble as well as f are optimized using Adam (Kingma & Ba, 2014). The first 8 iterations of IWS are initialized with queries of four LFs known to have accuracy between 0.7 and 0.75 drawn at random and four randomly drawn LFs with arbitrary accuracy. Subsequently, IWS chooses the next LFs to query. |