reproducibilityindex.ai

On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines

Authors: Marius Mosbach, Maksym Andriushchenko, Dietrich Klakow

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We analyze BERT, Ro BERTa, and ALBERT, ﬁne-tuned on commonly used datasets from the GLUE benchmark, and show that the observed instability is caused by optimization difﬁculties that lead to vanishing gradients. Additionally, we show that the remaining variance of the downstream task performance can be attributed to differences in generalization where ﬁne-tuned models with the same training loss exhibit noticeably different test performance.
Researcher Affiliation	Academia	Marius Mosbach Spoken Language Systems (LSV) Saarland Informatics Campus, Saarland University mmosbach@lsv.uni-saarland.de Maksym Andriushchenko Theory of Machine Learning Lab École polytechnique fédérale de Lausanne maksym.andriushchenko@epfl.ch Dietrich Klakow Spoken Language Systems (LSV) Saarland Informatics Campus, Saarland University dietrich.klakow@lsv.uni-saarland.de
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code to reproduce our results is available online: https://github.com/uds-lsv/bert-stable-fine-tuning.
Open Datasets	Yes	We study four datasets from the GLUE benchmark (Wang et al., 2019b) following previous work studying instability during ﬁne-tuning: Co LA, MRPC, RTE, and QNLI. Detailed statistics for each of the datasets can be found in Section 7.2 in the Appendix. All datasets are publicly available. The GLUE datasets can be downloaded here: https://github.com/nyu-mll/ jiant. Sci Tail is available at https://github.com/allenai/scitail.
Dataset Splits	Yes	We follow previous works (Phang et al., 2018; Dodge et al., 2020; Lee et al., 2020) and measure ﬁne-tuning stability using the development sets from the GLUE benchmark. Table 2: Dataset statistics and majority baselines. RTE MRPC Co LA QNLI Sci Tail Training 2491 3669 8551 104744 23596 Development 278 409 1043 5464 1304
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory amounts) used for running the experiments.
Software Dependencies	No	The paper states 'Our implementation is based on Hugging Face s transformers library (Wolf et al., 2019)', but does not provide specific version numbers for this library or other software dependencies like Python or PyTorch.
Experiment Setup	Yes	Unless mentioned otherwise, we follow the default ﬁne-tuning strategy recommended by Devlin et al. (2019): we ﬁne-tune uncased BERTLARGE (henceforth BERT) using a batch size of 16 and a learning rate of 2e 5. The learning rate is linearly increased from 0 to 2e 5 for the ﬁrst 10% of iterations which is known as a warmup and linearly decreased to 0 afterward. We apply dropout with probability p = 0.1 and weight decay with λ = 0.01. We train for 3 epochs on all datasets and use global gradient clipping. Following Devlin et al. (2019), we use the Adam W optimizer (Loshchilov and Hutter, 2019) without bias correction. Table 3: Hyperparameters used for ﬁne-tuning.