Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Co-training Improves Prompt-based Learning for Large Language Models
Authors: Hunter Lang, Monica N Agrawal, Yoon Kim, David Sontag
ICML 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that co-training (Blum & Mitchell, 1998) can improve the performance of prompt-based learning by using unlabeled data. We find that models trained in this manner can significantly improve performance on challenging datasets where there is currently a large gap between prompt-based learning and fully-supervised models. |
| Researcher Affiliation | Academia | Hunter Lang 1 Monica Agrawal 1 Yoon Kim 1 David Sontag 1 1MIT CSAIL. Correspondence to: <EMAIL>. |
| Pseudocode | Yes | The skeleton of our approach is shown in Algorithm 1 (full detail is provided in Algorithms 4 and 5 in the supplement). |
| Open Source Code | Yes | Our code is publicly available https://github.com/clinicalml/ cotrain-prompting |
| Open Datasets | Yes | We use the RTE (Dagan et al., 2005), CB (De Marneffe et al., 2019), TREC (Voorhees & Tice, 2000), and Bool Q (Clark et al., 2019) datasets. Full details for these datasets are in Appendix B. In the partial access setting, we do not evaluate on Bool Q due to the large amount of GPT-3 quota required for labeling. In the full access setting, we do not evaluate on TREC as T0 was pretrained on TREC. |
| Dataset Splits | Yes | This validation set was sampled uniformly from the training set to give a training/validation split of 90%/10%. |
| Hardware Specification | Yes | All models were trained on two NVIDIA A100 80Gb GPUs using Py Torch and the Transformers library (Wolf et al., 2020). |
| Software Dependencies | No | The paper mentions 'Py Torch and the Transformers library (Wolf et al., 2020)' and 'De BERTa-large (microsoft/deberta-large)' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | In each co-training iteration, we train the label model over view ϕ0 using Adam with learning 1e-4, weight decay 5e-3, and batch size 64 for 40 epochs. We fine-tune the last layer and pooler of De BERTa-large over ϕ1 for 20 epochs using Adam with learning rate 1e-5, weight decay 0.01, batch size 16. |