reproducibilityindex.ai

Black-Box Tuning for Language-Model-as-a-Service

Authors: Tianxiang Sun, Yunfan Shao, Hong Qian, Xuanjing Huang, Xipeng Qiu

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental results show that the black-box tuning with Ro BERTa on a few labeled samples not only significantly outperforms manual prompt and GPT3 s in-context learning, but also surpasses the gradient-based counterparts, i.e., prompt tuning and full model tuning.
Researcher Affiliation	Academia	1Fudan University 2East China Normal University 3Peng Cheng Laboratory.
Pseudocode	No	The paper describes the approach and optimization process but does not include a formal pseudocode block or an algorithm section.
Open Source Code	Yes	Our code is publicly available at https://github.com/txsun1997/Black-Box-Tuning
Open Datasets	Yes	Dataset. We conduct experiments on several common language understanding tasks including sentiment analysis, topic classification, natural language inference (NLI), and paraphrase. For sentiment analysis, we choose SST2 (Socher et al., 2013) and Yelp polarity (Zhang et al., 2015a). For topic classification, we choose AG s News and DBPedia (Zhang et al., 2015a). For NLI, we choose SNLI (Bowman et al., 2015) and RTE (Wang et al., 2019). For paraphrase, we choose MRPC (Dolan & Brockett, 2005).
Dataset Splits	Yes	We randomly select k samples for each class to construct a k-shot training set Dtrain, and compose a development set Ddev by randomly drawing another k samples from the original training set and ensure that \|Dtrain\| = \|Ddev\| to simulate the true few-shot learning setting (Perez et al., 2021).
Hardware Specification	Yes	All the methods are implemented with Py Torch (Paszke et al., 2019) and experimented on a single NVIDIA GTX 3090 GPU.
Software Dependencies	No	The paper mentions "Py Torch (Paszke et al., 2019)" and "ONNX Runtime" but does not specify version numbers for these software components. It does not list any other software dependencies with version numbers.
Experiment Setup	Yes	For black-box tuning, we give in Table 2 the default configuration of hyper-parameters used in our experiments. The effect of each hyper-parameter is explored in 4.3. Table 2 includes: Prompt length (L) 50, Subspace dimension (d) 500, Population size (λ) 20, Random projection (A) Uniform, Loss function L Cross Entropy, Budget (# of API calls) 8000. Additionally, for Prompt Tuning: "Adam optimizer (Kingma & Ba, 2015) with learning rate of 5e-4 and batch size of 16 for 1000 epochs."