SAS: Self-Augmentation Strategy for Language Model Pre-training

Authors: Yifei Xu, Jingqiao Zhang, Ru He, Liangzhu Ge, Chao Yang, Cheng Yang, Ying Nian Wu11586-11594

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that SAS outperforms ELECTRA and other state-of-the-art models in the GLUE tasks with similar or less computation cost.
Researcher Affiliation Collaboration Yifei Xu1*, Jingqiao Zhang2*, Ru He2*, Liangzhu Ge2*, Chao Yang2, Cheng Yang2 , Ying Nian Wu1 1 University of California, Los Angeles 2 Alibaba Group
Pseudocode No The paper describes the SAS framework and its workflow but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Code and pretrained model is available publicly at Github: https://github.com/alibaba/self-augmentation-strategy.
Open Datasets Yes We use the same pretraining data as BERT, ELECTRA-Small and ELECTRA-Base, which consists of 3.3 Billion tokens from Wikipedia and Books Corpus datasets.
Dataset Splits Yes All GLUE scores are based on the Dev dataset.
Hardware Specification Yes With 1 V100 GPU, the pre-training of SASDA-Small takes 37.5h; both SAS-Small and SASc-Small takes about 24h; and ELECTRA-Small takes about 35h. The pre-training costs 7.7 days by 8 V100 GPUs.
Software Dependencies Yes Our implementation3 is based on Huggingface Transformers 4.3 framework (Wolf et al. 2020).
Experiment Setup Yes For ELECTRA-Small model as well as all other small models, we use batch size 512 and 0.25M pre-training steps, instead of batch size 128 and 1M steps in Clark et al. (2020b), and double the learning rate accordingly.