Parameter-Efficient Transfer Learning for NLP

Authors: Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, Sylvain Gelly

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To demonstrate adapter s effectiveness, we transfer the recently proposed BERT Transformer model to 26 diverse text classification tasks, including the GLUE benchmark. Adapters attain near state-of-the-art performance, whilst adding only a few parameters per task. On GLUE, we attain within 0.4% of the performance of full fine-tuning, adding only 3.6% parameters per task.
Researcher Affiliation Collaboration 1Google Research 2Jagiellonian University. Correspondence to: Neil Houlsby <neilhoulsby@google.com>.
Pseudocode No The paper includes architectural diagrams but no structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide explicit statements or links indicating that the source code for the methodology is openly available.
Open Datasets Yes We first evaluate on GLUE.2 For these datasets, we transfer from the pre-trained BERTLARGE model, which contains 24 layers, and a total of 330M parameters, see Devlin et al. (2018) for details. and To further validate that adapters yields compact, performant, models, we test on additional, publicly available, text classification tasks.
Dataset Splits Yes For each dataset and algorithm, we run a hyperparameter sweep and select the best model according to accuracy on the validation set. and For GLUE, the validation set accuracy is reported.
Hardware Specification Yes All runs are trained on 4 Google Cloud TPUs with a batch size of 32.
Software Dependencies No The paper mentions using Adam optimizer and TensorFlow Hub, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes Our training procedure also follows Devlin et al. (2018). We optimize using Adam (Kingma & Ba, 2014), whose learning rate is increased linearly over the first 10% of the steps, and then decayed linearly to zero. All runs are trained on 4 Google Cloud TPUs with a batch size of 32. For each dataset and algorithm, we run a hyperparameter sweep and select the best model according to accuracy on the validation set. We sweep learning rates in {3 10 5, 3 10 4, 3 10 3}, and number of epochs in {3, 20}.