Don’t Stop Pretraining? Make Prompt-based Fine-tuning Powerful Learner

Authors: Zhengxiang Shi, Aldo Lipani

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical evaluations on 21 benchmarks demonstrate that the PCP consistently improves the performance of state-of-the-art prompt-based FT approaches (up to 20.1% absolute) in both semi-supervised and fully-supervised settings, even with only hundreds of unlabelled examples.
Researcher Affiliation Academia Zhengxiang Shi University College London London, United Kingdom zhengxiang.shi.19@ucl.ac.uk Aldo Lipani University College London London, United Kingdom aldo.lipani@ucl.ac.uk
Pseudocode No The paper describes the proposed method in textual steps and provides diagrams, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/ZhengxiangShi/PowerfulPromptFT.
Open Datasets Yes Following previous studies [28, 36, 100] on prompt-based FT, we derive 8 single-sentence tasks and 8 sentence-pair English tasks from the GLUE benchmark [87], SNLI [13], and 6 other widely used sentence classification tasks (i.e., SST-5, MR, CR, MPQA, Subj, TREC). Additionally, we use 5 popular benchmarks for semi-supervised learning from previous research [34, 21, 94, 48, 29, 77], including IMDB [55], AG NEWS [101], YELP REVIEW1, YAHOO! ANSWER [18], and AMAZON REVIEW [57].
Dataset Splits Yes Consistent with prior research [28], our validation set comprises 16 examples per class from the aforementioned datasets. Additionally, we use 16 examples per class for the training set and the entire training set as the unlabeled set in the semi-supervised setting. We also utilise the full training set for training purposes in the fully supervised setting.
Hardware Specification Yes In our experiments, performing the PCP on 1k unlabelled example takes less than 10 minutes using two 24GB NVIDIA 3090 GPUs
Software Dependencies No The paper mentions using "Pytorch" and "Huggingface" but does not specify their version numbers or other library versions needed for reproducibility.
Experiment Setup Yes See hyperparameter and implementation details in Appendix E. In each trial, we train the model for 1,000 steps, evaluate performance every 100 steps, and select the best checkpoint based on optimal performance on the evaluation set. The best performance is determined by the relevant evaluation metric. For continued pre-training, we utilise the same set of hyperparameters for both TAPT and PCP, as shown in Table 11.