On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines

Authors: Marius Mosbach, Maksym Andriushchenko, Dietrich Klakow

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We analyze BERT, Ro BERTa, and ALBERT, fine-tuned on commonly used datasets from the GLUE benchmark, and show that the observed instability is caused by optimization difficulties that lead to vanishing gradients. Additionally, we show that the remaining variance of the downstream task performance can be attributed to differences in generalization where fine-tuned models with the same training loss exhibit noticeably different test performance.
Researcher Affiliation Academia Marius Mosbach Spoken Language Systems (LSV) Saarland Informatics Campus, Saarland University mmosbach@lsv.uni-saarland.de Maksym Andriushchenko Theory of Machine Learning Lab École polytechnique fédérale de Lausanne maksym.andriushchenko@epfl.ch Dietrich Klakow Spoken Language Systems (LSV) Saarland Informatics Campus, Saarland University dietrich.klakow@lsv.uni-saarland.de
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code to reproduce our results is available online: https://github.com/uds-lsv/bert-stable-fine-tuning.
Open Datasets Yes We study four datasets from the GLUE benchmark (Wang et al., 2019b) following previous work studying instability during fine-tuning: Co LA, MRPC, RTE, and QNLI. Detailed statistics for each of the datasets can be found in Section 7.2 in the Appendix. All datasets are publicly available. The GLUE datasets can be downloaded here: https://github.com/nyu-mll/ jiant. Sci Tail is available at https://github.com/allenai/scitail.
Dataset Splits Yes We follow previous works (Phang et al., 2018; Dodge et al., 2020; Lee et al., 2020) and measure fine-tuning stability using the development sets from the GLUE benchmark. Table 2: Dataset statistics and majority baselines. RTE MRPC Co LA QNLI Sci Tail Training 2491 3669 8551 104744 23596 Development 278 409 1043 5464 1304
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory amounts) used for running the experiments.
Software Dependencies No The paper states 'Our implementation is based on Hugging Face s transformers library (Wolf et al., 2019)', but does not provide specific version numbers for this library or other software dependencies like Python or PyTorch.
Experiment Setup Yes Unless mentioned otherwise, we follow the default fine-tuning strategy recommended by Devlin et al. (2019): we fine-tune uncased BERTLARGE (henceforth BERT) using a batch size of 16 and a learning rate of 2e 5. The learning rate is linearly increased from 0 to 2e 5 for the first 10% of iterations which is known as a warmup and linearly decreased to 0 afterward. We apply dropout with probability p = 0.1 and weight decay with λ = 0.01. We train for 3 epochs on all datasets and use global gradient clipping. Following Devlin et al. (2019), we use the Adam W optimizer (Loshchilov and Hutter, 2019) without bias correction. Table 3: Hyperparameters used for fine-tuning.