Taking Notes on the Fly Helps Language Pre-Training

Authors: Qiyu Wu, Chen Xing, Yatao Li, Guolin Ke, Di He, Tie-Yan Liu

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement TNF on both BERT and ELECTRA to check its efficiency and effectiveness. Experimental results show that TNF s training time is 60% less than its backbone pre-training models when reaching the same performance.
Researcher Affiliation Collaboration 1Peking University 2College of Compute Science, Nankai University 3Microsoft Research
Pseudocode No The paper describes the method in text and diagrams but does not provide pseudocode or algorithm blocks.
Open Source Code Yes Source code is attached in the supplementary material.
Open Datasets Yes Following BERT (Devlin et al., 2018), we use the English Wikipedia corpus and Book Corpus (Zhu et al., 2015) for pre-training.
Dataset Splits Yes Each configuration is run five times with different random seeds, and the average of these five results on the validation set is calculated as the final performance of one configuration.
Hardware Specification Yes All models are run on 16 NVIDIA Tesla V100 GPUs with mixed-precision (Micikevicius et al., 2017).
Software Dependencies No The paper states 'All codes are implemented based on fairseq (Ott et al., 2019) in Py Torch (Paszke et al., 2017)' but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes All models are pre-trained for 1000k steps with batch size 256 and maximum sequence length 512. We use Adam (Kingma & Ba, 2014) as the optimizer, and set its hyperparameter ϵ to 1e-6 and (β1, β2) to (0.9, 0.98). The peak learning rate is set to 1e-4 with a 10k-step warm-up stage. After the warm-up stage, the learning rate decays linearly to zero. We set the dropout probability to 0.1 and weight decay to 0.01. There are three additional hyper-parameters for TNF, half window size k, discount factor λ and weight γ. We set k as 16, λ as 0.5, γ as 0.1 for the main experiment, except for ELECTRA k is empirically set as 32.