On the Role of Attention in Prompt-tuning

Authors: Samet Oymak, Ankit Singh Rawat, Mahdi Soltanolkotabi, Christos Thrampoulidis

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we empirically validate our theoretical insights on both synthetic contextual-mixture datasets and image-classification datasets. Specifically, we compare multiple variants of prompt-tuning against standard fine-tuning on the latter. 5. Experiments First, we verify the utility of prompt-attention via experiments on a synthetic setting that precisely follows the contextual data model from Section 2.2. Subsequently, we explore prompt-tuning on various image classification tasks that are motivated by the contextual data model and compare it with the standard fine-tuning method. Finally, we validate the utility of prompt vectors in distinguishing relevant tokens from irrelevant tokens via prompt-attention under an image classification setting.
Researcher Affiliation Collaboration *In alphabetical order 1University of Michigan & UC Riverside, USA 2Google Research NYC, USA 3University of Southern California, USA 4University of British Columbia, Canada.
Pseudocode Yes Algorithm: We split the train set in three separate subsets S1,S2,S3 of size n each. Starting from w0 = 0,q0 = 0, the algorithm proceeds in three gradient steps for step sizes η > 0 and γ > 0 and a final debiasing step as follows: w1 = η w LS1(0,0), q1 = γ q LS2(0, w1), w2 = η w LS3( q1, w1), where LSj,j = 1,2,3 is the loss in (8) evaluated on sets Sj. The debiasing step is defined in Section 4.3.
Open Source Code No The paper mentions using “Scenic library (Dehghani et al., 2022)4 to conduct our experiments on image classification. 4https://github.com/google-research/scenic” but does not provide source code for its own methodology.
Open Datasets Yes Dataset. Motivated by our contextual data model, we construct three datasets based on CIFAR-10 (Krizhevsky et al., 2009) to conduct our evaluation (see Fig. 3 for examples).
Dataset Splits No The paper states: “By construction, each dataset has 50,000 train and 10,000 test examples corresponding to train and test set of CIFAR-10.” It does not explicitly provide details about a validation split.
Hardware Specification No The paper does not specify the hardware used for its experiments.
Software Dependencies No The paper mentions using “Adam optimizer” and “Scenic library (Dehghani et al., 2022)” but does not specify software dependencies with version numbers.
Experiment Setup Yes We employ Adam optimizer (Kingma & Ba, 2014) with β1 = 0.9, β2 = 0.999, weight decay = 0.1, and batch size = 128 while training a randomly initialized model. Furthermore, we employ a linear warm-up of learning rate followed cosine learning rate schedule with base learning rate 3e-3. As for the fine-tuning and prompt-tuning experiments that (partially) initialize from a pre-trained model, we rely on SGD with momentum parameter 0.9 and batch size = 128 to update trainable parameters. Again, we utilize a linear warm-up of learning rate followed by cosine learning rate schedule. Throughout our experiments, the base learning rates for fine-tuning and prompt-tuning are 1e-3 and 0.1, respectively.