PromptBoosting: Black-Box Text Classification with Ten Forward Passes

Authors: Bairu Hou, Joe O’Connor, Jacob Andreas, Shiyu Chang, Yang Zhang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that PROMPTBOOSTING achieves state-of-the-art performance in multiple black-box few-shot classification tasks, and matches or outperforms full fine-tuning in both few-shot and standard learning paradigms, while training 10x faster than existing black-box methods. We evaluated our method on a number of downstream tasks. The results show that PROMPTBOOSTING achieves state-of-the-art performance and matches or even outperforms full fine-tuning in both few-shot and standard learning paradigms. 4. Experiments
Researcher Affiliation Collaboration Bairu Hou 1 Joe O Connor 2 Jacob Andreas 3 Shiyu Chang 1 Yang Zhang 4 1UC Santa Barbara 2UC Los Angeles 3MIT CSAIL 4MITIBM Watson AI Lab.
Pseudocode Yes Algorithm 1 Model Ensemble in PROMPTBOOSTING
Open Source Code Yes Codes are available at https://github.com/ UCSB-NLP-Chang/Prompt Boosting.
Open Datasets Yes Previous approaches for black-box prompt-based learning (Sun et al., 2022b;a; Deng et al., 2022; Zhang et al., 2022) are often evaluated on the following tasks: single sentence classification (including SST-2 (Socher et al., 2013), MR (Pang & Lee, 2005), TREC (Voorhees & Tice, 2000) and AG s News (Zhang et al., 2015)) and sentence-pair classification (including SNLI (Bowman et al., 2015), MNLI-m (Williams et al., 2018), QNLI (Rajpurkar et al., 2016), and RTE (Dagan et al., 2005)). We follow the same setting and report results on the datasets above. The dataset statistics can be found in Table 4 in Appendix A. For a more comprehensive understanding of our method, we incorporate additional datasets including SST-5 (Socher et al., 2013), CR (Hu & Liu, 2004), Subj (Pang & Lee, 2004), MPQA (Wiebe et al., 2005), MRPC (Dolan & Brockett, 2005) in Table 9 in Appendix B the conclusion is the same.
Dataset Splits Yes We randomly sample k examples per class from the original training set to construct a k-shot training set Dtr for model training. Following previous work (Gao et al., 2021; Zhang et al., 2021; Sun et al., 2022b), we also construct the validation set Dval by randomly sampling another k examples per class from the original training set (i.e., |Dtr| = |Dval|). By default we set k = 16 for our main experiments.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. It only mentions using the RoBERTa-large model as the backbone.
Software Dependencies No The paper mentions using the Huggingface transformers library but does not provide specific version numbers for it or any other software dependencies like Python, PyTorch, or TensorFlow.
Experiment Setup Yes The learning rate is set to 1e-5. We use Adam W optimizer as the optimizer and the learning rate linearly decays to 0. The training batch size is set to 16 and the total training epochs is 100. For our method, we sequentially train 200 weak classifiers on each task and add them to our ensemble we stop when validation performance plateaus or when we reach the maximum number of weak classifiers.