SAS: Self-Augmentation Strategy for Language Model Pre-training
Authors: Yifei Xu, Jingqiao Zhang, Ru He, Liangzhu Ge, Chao Yang, Cheng Yang, Ying Nian Wu11586-11594
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that SAS outperforms ELECTRA and other state-of-the-art models in the GLUE tasks with similar or less computation cost. |
| Researcher Affiliation | Collaboration | Yifei Xu1*, Jingqiao Zhang2*, Ru He2*, Liangzhu Ge2*, Chao Yang2, Cheng Yang2 , Ying Nian Wu1 1 University of California, Los Angeles 2 Alibaba Group |
| Pseudocode | No | The paper describes the SAS framework and its workflow but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and pretrained model is available publicly at Github: https://github.com/alibaba/self-augmentation-strategy. |
| Open Datasets | Yes | We use the same pretraining data as BERT, ELECTRA-Small and ELECTRA-Base, which consists of 3.3 Billion tokens from Wikipedia and Books Corpus datasets. |
| Dataset Splits | Yes | All GLUE scores are based on the Dev dataset. |
| Hardware Specification | Yes | With 1 V100 GPU, the pre-training of SASDA-Small takes 37.5h; both SAS-Small and SASc-Small takes about 24h; and ELECTRA-Small takes about 35h. The pre-training costs 7.7 days by 8 V100 GPUs. |
| Software Dependencies | Yes | Our implementation3 is based on Huggingface Transformers 4.3 framework (Wolf et al. 2020). |
| Experiment Setup | Yes | For ELECTRA-Small model as well as all other small models, we use batch size 512 and 0.25M pre-training steps, instead of batch size 128 and 1M steps in Clark et al. (2020b), and double the learning rate accordingly. |