reproducibilityindex.ai

Training Subset Selection for Weak Supervision

Authors: Hunter Lang, Aravindan Vijayaraghavan, David Sontag

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present numerical experiments demonstrating that the status quo of using all the pseudolabeled data is nearly always suboptimal. Combining good pretrained representations with the cut statistic [23] for subset selection, we obtain subsets of the weakly-labeled training data where the weak labels are very accurate. ... Our empirical study shows that this combination is very effective at selecting good pseudolabeled training data across a wide variety of label models, end models, and datasets. We evaluate our approach on the WRENCH benchmark [42] for weak supervision. We compare the status-quo of full coverage (β 1.0) to β chosen from t0.1, 0.2, . . . , 1.0u. We evaluate our approach with ﬁve different label models: Majority Vote (MV), the original Snorkel/Data Programming (DP), [30], Dawid-Skene (DS) [8], Flying Squid (FS) [10], and Me Ta L [29].
Researcher Affiliation	Academia	Hunter Lang MIT CSAIL hjl@mit.edu Aravindan Vijayaraghavan Northwestern University aravindv@northwestern.edu David Sontag MIT CSAIL dsontag@mit.edu
Pseudocode	No	The paper mentions providing code in Appendix C but does not include any pseudocode or a formally labeled algorithm block in the main text.
Open Source Code	Yes	We include the code for reproducing our empirical results in the supplementary material.
Open Datasets	Yes	We evaluate our approach on the WRENCH benchmark [42] for weak supervision. ... Full details for the datasets and the weak label sources are available in [42] Table 5 and reproduced here in Appendix B.1.
Dataset Splits	Yes	Hyperparameter tuning. Our subset selection approach introduces a new hyperparameter, β the fraction of covered data to retain for training the classiﬁer. ... choosing the value with the best (ground-truth) validation performance. ... The average validation set size of the WRENCH datasets from Table 1 is over 2,500 examples. ... We compare choosing the best model checkpoint and picking the best coverage fraction β using (i) the full validation set and (ii) a randomly-sampled validation set of 100 examples.
Hardware Specification	Yes	We performed all model training on NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions using 'pretrained roberta-base and bert-base-cased' models and downloading weights from 'huggingface.co/datasets', but does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks.
Experiment Setup	Yes	To keep the hyperparameter tuning burden low, we ﬁrst tune all other hyperparameters identically to Zhang et al. [42] holding β ﬁxed at 1.0. We then use the optimal hyperparameters (learning rate, batch size, weight decay, etc.) from β 1.0 for a grid search over values of β P t0.1, 0.2, . . . , 1.0u... In all of our experiments, we used K 20 nearest neighbors to compute the cut statistic and performed no tuning on this value.