IsoBN: Fine-Tuning BERT with Isotropic Batch Normalization

Authors: Wenxuan Zhou, Bill Yuchen Lin, Xiang Ren14621-14629

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we analyze the isotropy of the pre-trained [CLS] embeddings of PTLMs with straightforward visualization, and point out two major issues... We also propose a new network regularization method, isotropic batch normalization (Iso BN) to address the issues... This simple yet effective fine-tuning method yields about 1.0 absolute increment on the average of seven NLU tasks.
Researcher Affiliation Academia Wenxuan Zhou, Bill Yuchen Lin, Xiang Ren Department of Computer Science, University of Southern California, Los Angeles, CA {zhouwenx, yuchen.lin, xiangren}@usc.edu
Pseudocode Yes The whole algorithm of Iso BN is shown in Algorithm 1.
Open Source Code No The paper does not provide an explicit statement about releasing source code for the described methodology or a link to a code repository.
Open Datasets Yes We evaluate Iso BN on two PTLMs (BERT-base-cased and RoBERTa-large) and seven NLU tasks from the GLUE benckmark (Wang et al. 2019b).
Dataset Splits Yes We apply early stopping according to task-specific metrics on the dev set. We select the best combination of hyperparameters on the dev set. We fine-tune the PTLMs with 5 different random seeds and report the median and standard deviation of metrics on the dev set.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No Our implementation of PTLMs is based on Hugging Face Transformer (Wolf et al. 2019). The model is fine-tuned with AdamW (Loshchilov and Hutter 2019) optimizer... The paper mentions software tools but does not specify version numbers for them (e.g., 'Hugging Face Transformer' without a version).
Experiment Setup Yes The model is fine-tuned with AdamW (Loshchilov and Hutter 2019) optimizer using a learning rate in the range of {1e-5, 2e-5, 5e-5} and batch size in {16, 32}. The learning rate is scheduled by a linear warm-up (Goyal et al. 2017) for the first 6% of steps followed by a linear decay to 0. The maximum number of training epochs is set to 10. For Iso BN, the momentum α is set to 0.95, the ϵ is set to 0.1, and the normalization strength β is chosen in the range of {0.25, 0.5, 1}.