Parameter-Efficient Transfer Learning for NLP
Authors: Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, Sylvain Gelly
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate adapter s effectiveness, we transfer the recently proposed BERT Transformer model to 26 diverse text classification tasks, including the GLUE benchmark. Adapters attain near state-of-the-art performance, whilst adding only a few parameters per task. On GLUE, we attain within 0.4% of the performance of full fine-tuning, adding only 3.6% parameters per task. |
| Researcher Affiliation | Collaboration | 1Google Research 2Jagiellonian University. Correspondence to: Neil Houlsby <neilhoulsby@google.com>. |
| Pseudocode | No | The paper includes architectural diagrams but no structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide explicit statements or links indicating that the source code for the methodology is openly available. |
| Open Datasets | Yes | We first evaluate on GLUE.2 For these datasets, we transfer from the pre-trained BERTLARGE model, which contains 24 layers, and a total of 330M parameters, see Devlin et al. (2018) for details. and To further validate that adapters yields compact, performant, models, we test on additional, publicly available, text classification tasks. |
| Dataset Splits | Yes | For each dataset and algorithm, we run a hyperparameter sweep and select the best model according to accuracy on the validation set. and For GLUE, the validation set accuracy is reported. |
| Hardware Specification | Yes | All runs are trained on 4 Google Cloud TPUs with a batch size of 32. |
| Software Dependencies | No | The paper mentions using Adam optimizer and TensorFlow Hub, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | Our training procedure also follows Devlin et al. (2018). We optimize using Adam (Kingma & Ba, 2014), whose learning rate is increased linearly over the first 10% of the steps, and then decayed linearly to zero. All runs are trained on 4 Google Cloud TPUs with a batch size of 32. For each dataset and algorithm, we run a hyperparameter sweep and select the best model according to accuracy on the validation set. We sweep learning rates in {3 10 5, 3 10 4, 3 10 3}, and number of epochs in {3, 20}. |