DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning
Authors: Zhengxiang Shi, Aldo Lipani
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments on 23 natural language processing (NLP) and vision-language (VL) tasks, we demonstrate that DEPT outperforms state-of-the-art PEFT approaches, including the full fine-tuning baseline, in some scenarios. |
| Researcher Affiliation | Academia | Zhengxiang Shi, Aldo Lipani University College London, United Kingdom {zhengxiang.shi.19,aldo.lipani}@ucl.ac.uk |
| Pseudocode | No | The paper describes the method in text and figures but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/Zhengxiang Shi/De PT |
| Open Datasets | Yes | We evaluate our proposed method DEPT on 21 NLP tasks and 2 visionlanguage tasks. For NLP tasks, we follow the previous works (Vu et al., 2022; Sung et al., 2022b; Asai et al., 2022; Wang et al., 2023b) and use various datasets sourced from: (1) GLUE (Wang et al., 2018) benchmark... (2) Super GLUE benchmark (Wang et al., 2019)... (3) MRQA 2019 Shared Task (Fisch et al., 2019)... (4) other datasets, including Wino Grande (Sakaguchi et al., 2021), Yelp-2 (Zhang et al., 2015), Sci Tail (Khot et al., 2018) and PAWS-Wiki (Zhang et al., 2019). For vision-language tasks, we follow prior works (Sung et al., 2022a;b) to experiment with the visual question-answering task, VQA (Goyal et al., 2017), and the image caption generation task, MSCOCO (Chen et al., 2015). |
| Dataset Splits | Yes | For a fair comparison, we directly quote performance metrics from published papers (Mahabadi et al., 2021; Karimi Mahabadi et al., 2021; Asai et al., 2022; Wang et al., 2023b; Sung et al., 2022b) for a fair comparison, where all these baselines using the T5-BASE as the backbone and adhere to the train, validation and test splits used by Karimi Mahabadi et al. (2021); Mahabadi et al. (2021) for NLP tasks and by Sung et al. (2022b) for vision-language tasks. |
| Hardware Specification | Yes | Figure 4 represents the inference speed, measured by the average number of samples evaluated per second on the GLUE benchmark using a single RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions software like Pytorch, Huggingface Transformers, and Huggingface PEFT but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | In our study, we mainly experiment using the T5-BASE model with 220M parameters (Raffel et al., 2020). We consistently set the number of virtual tokens l as 100 across all tasks for the vanilla PT and adjust the hyper-parameters of DEPT accordingly to maintain the equivalent number of trainable parameters. For instance, the vanilla PT contains l d trainable parameters where the hidden size d is 768 for the T5-BASE, and DEPT can configure the number of virtual tokens m as 40 and the rank of low matrices r as 45, resulting in m d+(s+d) r trainable parameters. This yields a total of 76, 800 trainable parameters, aligning with the vanilla PT. For VL tasks, we utilise the CLIP-T5 architecture which combines CLIP (Radford et al., 2021) and T5BASE (Raffel et al., 2020), with the CLIP frozen. We follow the prior work (Sung et al., 2022b) to concatenate the visual representation from CLIP with the text embedding from the T5-BASE, where a trainable visual projection layer is used between CLIP and T5 to align the visual representation to the same dimension as the text embedding. We also extend our evaluation to include T5-SMALL (60M), T5-LARGE (770M), GPT2-SMALL (110M), GPT2-MEDIUM (345M), and GPT2-LARGE (774M) models. ... We conduct a grid search for learning rates. For the soft prompt, we search the learning rate within the set {3e-1, 4e-1, 5e-1}. For the low-rank matrice pairs, we search the learning rate within the set {1e-04, 5e-4, 1e-03, 5e-03}. We choose a batch size of 16. We typically use the max sequence length as 256 except for the Super GLUE-Multi RC, where the max sequence length is 348. In each trial, we train the model for 30,000 steps, evaluate performance every 1,000 steps, and select the best checkpoint based on optimal performance on the evaluation set. For the large dataset with more than 100,000 training examples, we follow the prior work (Vu et al., 2022) to train the vanilla PT and our proposed method DEPT with up to 300,000 steps. Training more steps helps improve the performance of the vanilla PT for the large dataset. The best performance is determined by the relevant evaluation metric. |