On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines
Authors: Marius Mosbach, Maksym Andriushchenko, Dietrich Klakow
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We analyze BERT, Ro BERTa, and ALBERT, fine-tuned on commonly used datasets from the GLUE benchmark, and show that the observed instability is caused by optimization difficulties that lead to vanishing gradients. Additionally, we show that the remaining variance of the downstream task performance can be attributed to differences in generalization where fine-tuned models with the same training loss exhibit noticeably different test performance. |
| Researcher Affiliation | Academia | Marius Mosbach Spoken Language Systems (LSV) Saarland Informatics Campus, Saarland University mmosbach@lsv.uni-saarland.de Maksym Andriushchenko Theory of Machine Learning Lab École polytechnique fédérale de Lausanne maksym.andriushchenko@epfl.ch Dietrich Klakow Spoken Language Systems (LSV) Saarland Informatics Campus, Saarland University dietrich.klakow@lsv.uni-saarland.de |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code to reproduce our results is available online: https://github.com/uds-lsv/bert-stable-fine-tuning. |
| Open Datasets | Yes | We study four datasets from the GLUE benchmark (Wang et al., 2019b) following previous work studying instability during fine-tuning: Co LA, MRPC, RTE, and QNLI. Detailed statistics for each of the datasets can be found in Section 7.2 in the Appendix. All datasets are publicly available. The GLUE datasets can be downloaded here: https://github.com/nyu-mll/ jiant. Sci Tail is available at https://github.com/allenai/scitail. |
| Dataset Splits | Yes | We follow previous works (Phang et al., 2018; Dodge et al., 2020; Lee et al., 2020) and measure fine-tuning stability using the development sets from the GLUE benchmark. Table 2: Dataset statistics and majority baselines. RTE MRPC Co LA QNLI Sci Tail Training 2491 3669 8551 104744 23596 Development 278 409 1043 5464 1304 |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory amounts) used for running the experiments. |
| Software Dependencies | No | The paper states 'Our implementation is based on Hugging Face s transformers library (Wolf et al., 2019)', but does not provide specific version numbers for this library or other software dependencies like Python or PyTorch. |
| Experiment Setup | Yes | Unless mentioned otherwise, we follow the default fine-tuning strategy recommended by Devlin et al. (2019): we fine-tune uncased BERTLARGE (henceforth BERT) using a batch size of 16 and a learning rate of 2e 5. The learning rate is linearly increased from 0 to 2e 5 for the first 10% of iterations which is known as a warmup and linearly decreased to 0 afterward. We apply dropout with probability p = 0.1 and weight decay with λ = 0.01. We train for 3 epochs on all datasets and use global gradient clipping. Following Devlin et al. (2019), we use the Adam W optimizer (Loshchilov and Hutter, 2019) without bias correction. Table 3: Hyperparameters used for fine-tuning. |