Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models

Authors: Cheolhyoung Lee, Kyunghyun Cho, Wanmo Kang

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically evaluate the proposed mixout and its variants on finetuning a pretrained language model on downstream tasks.
Researcher Affiliation Collaboration Cheolhyoung Lee cheolhyoung.lee@kaist.ac.kr Kyunghyun Cho kyunghyun.cho@nyu.edu Wanmo Kang wanmo.kang@kaist.ac.kr Department of Mathematical Sciences, KAIST, Daejeon, 34141, Republic of Korea New York University Facebook AI Research CIFAR Azrieli Global Scholar
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper refers to publicly available implementations of BERT by Hugging Face, which is a third-party tool used in their experiments, but does not provide an explicit statement or link for their own Mixout implementation code.
Open Datasets Yes To validate our theoretical findings, we train a fully connected network on EMNIST Digits (Cohen et al., 2017) and finetune it on MNIST. In order to experimentally validate the effectiveness of mixout, we finetune BERTLARGE on a subset of GLUE (Wang et al., 2018) tasks (RTE, MRPC, Co LA, and STS-B) with mixout(wpre). SST-2 (67,000 training examples): Binary sentiment classification (Socher et al., 2013).
Dataset Splits Yes We use 240,000 characters provide for training and split these into the training set (216,000 characters) and validation set (24,000 characters). For finetuning, we train our model on MNIST. This has 70,000 characters into 10 balance classes. MNIST provide 60,000 characters for training and 10,000 characters for test. We use 60,000 characters given for training and split these into the training set (54,000 characters) and validation set (6,000 characters).
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models or processor types used for running its experiments.
Software Dependencies No The paper mentions 'Py Torch by Hugging Face' but does not provide specific version numbers for PyTorch or the Hugging Face library, nor any other software dependencies with versions.
Experiment Setup Yes We use Adam (Kingma & Ba, 2014) with a learning rate of 10-4, β1 = 0.9, β2 = 0.999, wdecay(0, 0.01), learning rate warm-up over the first 10% steps of the total steps, and linear decay of the learning rate after the warm-up. We use dropout(0.1) for all layers except the input and output layers. We train with a batch size of 32 for 3 training epochs. Since the pretrained BERTLARGE is the sentence encoder, we have to create an additional output layer, which is not pretrained. We initialize each parameter of it with N(0, 0.022).