Training Subset Selection for Weak Supervision

Authors: Hunter Lang, Aravindan Vijayaraghavan, David Sontag

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present numerical experiments demonstrating that the status quo of using all the pseudolabeled data is nearly always suboptimal. Combining good pretrained representations with the cut statistic [23] for subset selection, we obtain subsets of the weakly-labeled training data where the weak labels are very accurate. ... Our empirical study shows that this combination is very effective at selecting good pseudolabeled training data across a wide variety of label models, end models, and datasets. We evaluate our approach on the WRENCH benchmark [42] for weak supervision. We compare the status-quo of full coverage (β 1.0) to β chosen from t0.1, 0.2, . . . , 1.0u. We evaluate our approach with five different label models: Majority Vote (MV), the original Snorkel/Data Programming (DP), [30], Dawid-Skene (DS) [8], Flying Squid (FS) [10], and Me Ta L [29].
Researcher Affiliation Academia Hunter Lang MIT CSAIL hjl@mit.edu Aravindan Vijayaraghavan Northwestern University aravindv@northwestern.edu David Sontag MIT CSAIL dsontag@mit.edu
Pseudocode No The paper mentions providing code in Appendix C but does not include any pseudocode or a formally labeled algorithm block in the main text.
Open Source Code Yes We include the code for reproducing our empirical results in the supplementary material.
Open Datasets Yes We evaluate our approach on the WRENCH benchmark [42] for weak supervision. ... Full details for the datasets and the weak label sources are available in [42] Table 5 and reproduced here in Appendix B.1.
Dataset Splits Yes Hyperparameter tuning. Our subset selection approach introduces a new hyperparameter, β the fraction of covered data to retain for training the classifier. ... choosing the value with the best (ground-truth) validation performance. ... The average validation set size of the WRENCH datasets from Table 1 is over 2,500 examples. ... We compare choosing the best model checkpoint and picking the best coverage fraction β using (i) the full validation set and (ii) a randomly-sampled validation set of 100 examples.
Hardware Specification Yes We performed all model training on NVIDIA A100 GPUs.
Software Dependencies No The paper mentions using 'pretrained roberta-base and bert-base-cased' models and downloading weights from 'huggingface.co/datasets', but does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks.
Experiment Setup Yes To keep the hyperparameter tuning burden low, we first tune all other hyperparameters identically to Zhang et al. [42] holding β fixed at 1.0. We then use the optimal hyperparameters (learning rate, batch size, weight decay, etc.) from β 1.0 for a grid search over values of β P t0.1, 0.2, . . . , 1.0u... In all of our experiments, we used K 20 nearest neighbors to compute the cut statistic and performed no tuning on this value.