On the Role of Attention in Prompt-tuning
Authors: Samet Oymak, Ankit Singh Rawat, Mahdi Soltanolkotabi, Christos Thrampoulidis
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we empirically validate our theoretical insights on both synthetic contextual-mixture datasets and image-classification datasets. Specifically, we compare multiple variants of prompt-tuning against standard fine-tuning on the latter. 5. Experiments First, we verify the utility of prompt-attention via experiments on a synthetic setting that precisely follows the contextual data model from Section 2.2. Subsequently, we explore prompt-tuning on various image classification tasks that are motivated by the contextual data model and compare it with the standard fine-tuning method. Finally, we validate the utility of prompt vectors in distinguishing relevant tokens from irrelevant tokens via prompt-attention under an image classification setting. |
| Researcher Affiliation | Collaboration | *In alphabetical order 1University of Michigan & UC Riverside, USA 2Google Research NYC, USA 3University of Southern California, USA 4University of British Columbia, Canada. |
| Pseudocode | Yes | Algorithm: We split the train set in three separate subsets S1,S2,S3 of size n each. Starting from w0 = 0,q0 = 0, the algorithm proceeds in three gradient steps for step sizes η > 0 and γ > 0 and a final debiasing step as follows: w1 = η w LS1(0,0), q1 = γ q LS2(0, w1), w2 = η w LS3( q1, w1), where LSj,j = 1,2,3 is the loss in (8) evaluated on sets Sj. The debiasing step is defined in Section 4.3. |
| Open Source Code | No | The paper mentions using “Scenic library (Dehghani et al., 2022)4 to conduct our experiments on image classification. 4https://github.com/google-research/scenic” but does not provide source code for its own methodology. |
| Open Datasets | Yes | Dataset. Motivated by our contextual data model, we construct three datasets based on CIFAR-10 (Krizhevsky et al., 2009) to conduct our evaluation (see Fig. 3 for examples). |
| Dataset Splits | No | The paper states: “By construction, each dataset has 50,000 train and 10,000 test examples corresponding to train and test set of CIFAR-10.” It does not explicitly provide details about a validation split. |
| Hardware Specification | No | The paper does not specify the hardware used for its experiments. |
| Software Dependencies | No | The paper mentions using “Adam optimizer” and “Scenic library (Dehghani et al., 2022)” but does not specify software dependencies with version numbers. |
| Experiment Setup | Yes | We employ Adam optimizer (Kingma & Ba, 2014) with β1 = 0.9, β2 = 0.999, weight decay = 0.1, and batch size = 128 while training a randomly initialized model. Furthermore, we employ a linear warm-up of learning rate followed cosine learning rate schedule with base learning rate 3e-3. As for the fine-tuning and prompt-tuning experiments that (partially) initialize from a pre-trained model, we rely on SGD with momentum parameter 0.9 and batch size = 128 to update trainable parameters. Again, we utilize a linear warm-up of learning rate followed by cosine learning rate schedule. Throughout our experiments, the base learning rates for fine-tuning and prompt-tuning are 1e-3 and 0.1, respectively. |