IsoBN: Fine-Tuning BERT with Isotropic Batch Normalization
Authors: Wenxuan Zhou, Bill Yuchen Lin, Xiang Ren14621-14629
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we analyze the isotropy of the pre-trained [CLS] embeddings of PTLMs with straightforward visualization, and point out two major issues... We also propose a new network regularization method, isotropic batch normalization (Iso BN) to address the issues... This simple yet effective fine-tuning method yields about 1.0 absolute increment on the average of seven NLU tasks. |
| Researcher Affiliation | Academia | Wenxuan Zhou, Bill Yuchen Lin, Xiang Ren Department of Computer Science, University of Southern California, Los Angeles, CA {zhouwenx, yuchen.lin, xiangren}@usc.edu |
| Pseudocode | Yes | The whole algorithm of Iso BN is shown in Algorithm 1. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | We evaluate Iso BN on two PTLMs (BERT-base-cased and RoBERTa-large) and seven NLU tasks from the GLUE benckmark (Wang et al. 2019b). |
| Dataset Splits | Yes | We apply early stopping according to task-specific metrics on the dev set. We select the best combination of hyperparameters on the dev set. We fine-tune the PTLMs with 5 different random seeds and report the median and standard deviation of metrics on the dev set. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | Our implementation of PTLMs is based on Hugging Face Transformer (Wolf et al. 2019). The model is fine-tuned with AdamW (Loshchilov and Hutter 2019) optimizer... The paper mentions software tools but does not specify version numbers for them (e.g., 'Hugging Face Transformer' without a version). |
| Experiment Setup | Yes | The model is fine-tuned with AdamW (Loshchilov and Hutter 2019) optimizer using a learning rate in the range of {1e-5, 2e-5, 5e-5} and batch size in {16, 32}. The learning rate is scheduled by a linear warm-up (Goyal et al. 2017) for the first 6% of steps followed by a linear decay to 0. The maximum number of training epochs is set to 10. For Iso BN, the momentum α is set to 0.95, the ϵ is set to 0.1, and the normalization strength β is chosen in the range of {0.25, 0.5, 1}. |