PromptBoosting: Black-Box Text Classification with Ten Forward Passes
Authors: Bairu Hou, Joe O’Connor, Jacob Andreas, Shiyu Chang, Yang Zhang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that PROMPTBOOSTING achieves state-of-the-art performance in multiple black-box few-shot classification tasks, and matches or outperforms full fine-tuning in both few-shot and standard learning paradigms, while training 10x faster than existing black-box methods. We evaluated our method on a number of downstream tasks. The results show that PROMPTBOOSTING achieves state-of-the-art performance and matches or even outperforms full fine-tuning in both few-shot and standard learning paradigms. 4. Experiments |
| Researcher Affiliation | Collaboration | Bairu Hou 1 Joe O Connor 2 Jacob Andreas 3 Shiyu Chang 1 Yang Zhang 4 1UC Santa Barbara 2UC Los Angeles 3MIT CSAIL 4MITIBM Watson AI Lab. |
| Pseudocode | Yes | Algorithm 1 Model Ensemble in PROMPTBOOSTING |
| Open Source Code | Yes | Codes are available at https://github.com/ UCSB-NLP-Chang/Prompt Boosting. |
| Open Datasets | Yes | Previous approaches for black-box prompt-based learning (Sun et al., 2022b;a; Deng et al., 2022; Zhang et al., 2022) are often evaluated on the following tasks: single sentence classification (including SST-2 (Socher et al., 2013), MR (Pang & Lee, 2005), TREC (Voorhees & Tice, 2000) and AG s News (Zhang et al., 2015)) and sentence-pair classification (including SNLI (Bowman et al., 2015), MNLI-m (Williams et al., 2018), QNLI (Rajpurkar et al., 2016), and RTE (Dagan et al., 2005)). We follow the same setting and report results on the datasets above. The dataset statistics can be found in Table 4 in Appendix A. For a more comprehensive understanding of our method, we incorporate additional datasets including SST-5 (Socher et al., 2013), CR (Hu & Liu, 2004), Subj (Pang & Lee, 2004), MPQA (Wiebe et al., 2005), MRPC (Dolan & Brockett, 2005) in Table 9 in Appendix B the conclusion is the same. |
| Dataset Splits | Yes | We randomly sample k examples per class from the original training set to construct a k-shot training set Dtr for model training. Following previous work (Gao et al., 2021; Zhang et al., 2021; Sun et al., 2022b), we also construct the validation set Dval by randomly sampling another k examples per class from the original training set (i.e., |Dtr| = |Dval|). By default we set k = 16 for our main experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. It only mentions using the RoBERTa-large model as the backbone. |
| Software Dependencies | No | The paper mentions using the Huggingface transformers library but does not provide specific version numbers for it or any other software dependencies like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | The learning rate is set to 1e-5. We use Adam W optimizer as the optimizer and the learning rate linearly decays to 0. The training batch size is set to 16 and the total training epochs is 100. For our method, we sequentially train 200 weak classifiers on each task and add them to our ensemble we stop when validation performance plateaus or when we reach the maximum number of weak classifiers. |