reproducibilityindex.ai

Parameter-Efficient Transfer Learning for NLP

Authors: Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, Sylvain Gelly

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To demonstrate adapter s effectiveness, we transfer the recently proposed BERT Transformer model to 26 diverse text classiﬁcation tasks, including the GLUE benchmark. Adapters attain near state-of-the-art performance, whilst adding only a few parameters per task. On GLUE, we attain within 0.4% of the performance of full ﬁne-tuning, adding only 3.6% parameters per task.
Researcher Affiliation	Collaboration	1Google Research 2Jagiellonian University. Correspondence to: Neil Houlsby <neilhoulsby@google.com>.
Pseudocode	No	The paper includes architectural diagrams but no structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide explicit statements or links indicating that the source code for the methodology is openly available.
Open Datasets	Yes	We ﬁrst evaluate on GLUE.2 For these datasets, we transfer from the pre-trained BERTLARGE model, which contains 24 layers, and a total of 330M parameters, see Devlin et al. (2018) for details. and To further validate that adapters yields compact, performant, models, we test on additional, publicly available, text classiﬁcation tasks.
Dataset Splits	Yes	For each dataset and algorithm, we run a hyperparameter sweep and select the best model according to accuracy on the validation set. and For GLUE, the validation set accuracy is reported.
Hardware Specification	Yes	All runs are trained on 4 Google Cloud TPUs with a batch size of 32.
Software Dependencies	No	The paper mentions using Adam optimizer and TensorFlow Hub, but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	Our training procedure also follows Devlin et al. (2018). We optimize using Adam (Kingma & Ba, 2014), whose learning rate is increased linearly over the ﬁrst 10% of the steps, and then decayed linearly to zero. All runs are trained on 4 Google Cloud TPUs with a batch size of 32. For each dataset and algorithm, we run a hyperparameter sweep and select the best model according to accuracy on the validation set. We sweep learning rates in {3 10 5, 3 10 4, 3 10 3}, and number of epochs in {3, 20}.