Black-Box Tuning for Language-Model-as-a-Service

Authors: Tianxiang Sun, Yunfan Shao, Hong Qian, Xuanjing Huang, Xipeng Qiu

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results show that the black-box tuning with Ro BERTa on a few labeled samples not only significantly outperforms manual prompt and GPT3 s in-context learning, but also surpasses the gradient-based counterparts, i.e., prompt tuning and full model tuning.
Researcher Affiliation Academia 1Fudan University 2East China Normal University 3Peng Cheng Laboratory.
Pseudocode No The paper describes the approach and optimization process but does not include a formal pseudocode block or an algorithm section.
Open Source Code Yes Our code is publicly available at https://github.com/txsun1997/Black-Box-Tuning
Open Datasets Yes Dataset. We conduct experiments on several common language understanding tasks including sentiment analysis, topic classification, natural language inference (NLI), and paraphrase. For sentiment analysis, we choose SST2 (Socher et al., 2013) and Yelp polarity (Zhang et al., 2015a). For topic classification, we choose AG s News and DBPedia (Zhang et al., 2015a). For NLI, we choose SNLI (Bowman et al., 2015) and RTE (Wang et al., 2019). For paraphrase, we choose MRPC (Dolan & Brockett, 2005).
Dataset Splits Yes We randomly select k samples for each class to construct a k-shot training set Dtrain, and compose a development set Ddev by randomly drawing another k samples from the original training set and ensure that |Dtrain| = |Ddev| to simulate the true few-shot learning setting (Perez et al., 2021).
Hardware Specification Yes All the methods are implemented with Py Torch (Paszke et al., 2019) and experimented on a single NVIDIA GTX 3090 GPU.
Software Dependencies No The paper mentions "Py Torch (Paszke et al., 2019)" and "ONNX Runtime" but does not specify version numbers for these software components. It does not list any other software dependencies with version numbers.
Experiment Setup Yes For black-box tuning, we give in Table 2 the default configuration of hyper-parameters used in our experiments. The effect of each hyper-parameter is explored in 4.3. Table 2 includes: Prompt length (L) 50, Subspace dimension (d) 500, Population size (λ) 20, Random projection (A) Uniform, Loss function L Cross Entropy, Budget (# of API calls) 8000. Additionally, for Prompt Tuning: "Adam optimizer (Kingma & Ba, 2015) with learning rate of 5e-4 and batch size of 16 for 1000 epochs."