Co-training Improves Prompt-based Learning for Large Language Models
Authors: Hunter Lang, Monica N Agrawal, Yoon Kim, David Sontag
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that co-training (Blum & Mitchell, 1998) can improve the performance of prompt-based learning by using unlabeled data. We find that models trained in this manner can significantly improve performance on challenging datasets where there is currently a large gap between prompt-based learning and fully-supervised models. |
| Researcher Affiliation | Academia | Hunter Lang 1 Monica Agrawal 1 Yoon Kim 1 David Sontag 1 1MIT CSAIL. Correspondence to: <hjl@mit.edu>. |
| Pseudocode | Yes | The skeleton of our approach is shown in Algorithm 1 (full detail is provided in Algorithms 4 and 5 in the supplement). |
| Open Source Code | Yes | Our code is publicly available https://github.com/clinicalml/ cotrain-prompting |
| Open Datasets | Yes | We use the RTE (Dagan et al., 2005), CB (De Marneffe et al., 2019), TREC (Voorhees & Tice, 2000), and Bool Q (Clark et al., 2019) datasets. Full details for these datasets are in Appendix B. In the partial access setting, we do not evaluate on Bool Q due to the large amount of GPT-3 quota required for labeling. In the full access setting, we do not evaluate on TREC as T0 was pretrained on TREC. |
| Dataset Splits | Yes | This validation set was sampled uniformly from the training set to give a training/validation split of 90%/10%. |
| Hardware Specification | Yes | All models were trained on two NVIDIA A100 80Gb GPUs using Py Torch and the Transformers library (Wolf et al., 2020). |
| Software Dependencies | No | The paper mentions 'Py Torch and the Transformers library (Wolf et al., 2020)' and 'De BERTa-large (microsoft/deberta-large)' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | In each co-training iteration, we train the label model over view ϕ0 using Adam with learning 1e-4, weight decay 5e-3, and batch size 64 for 40 epochs. We fine-tune the last layer and pooler of De BERTa-large over ϕ1 for 20 epochs using Adam with learning rate 1e-5, weight decay 0.01, batch size 16. |