Taking Notes on the Fly Helps Language Pre-Training
Authors: Qiyu Wu, Chen Xing, Yatao Li, Guolin Ke, Di He, Tie-Yan Liu
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implement TNF on both BERT and ELECTRA to check its efficiency and effectiveness. Experimental results show that TNF s training time is 60% less than its backbone pre-training models when reaching the same performance. |
| Researcher Affiliation | Collaboration | 1Peking University 2College of Compute Science, Nankai University 3Microsoft Research |
| Pseudocode | No | The paper describes the method in text and diagrams but does not provide pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source code is attached in the supplementary material. |
| Open Datasets | Yes | Following BERT (Devlin et al., 2018), we use the English Wikipedia corpus and Book Corpus (Zhu et al., 2015) for pre-training. |
| Dataset Splits | Yes | Each configuration is run five times with different random seeds, and the average of these five results on the validation set is calculated as the final performance of one configuration. |
| Hardware Specification | Yes | All models are run on 16 NVIDIA Tesla V100 GPUs with mixed-precision (Micikevicius et al., 2017). |
| Software Dependencies | No | The paper states 'All codes are implemented based on fairseq (Ott et al., 2019) in Py Torch (Paszke et al., 2017)' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | All models are pre-trained for 1000k steps with batch size 256 and maximum sequence length 512. We use Adam (Kingma & Ba, 2014) as the optimizer, and set its hyperparameter ϵ to 1e-6 and (β1, β2) to (0.9, 0.98). The peak learning rate is set to 1e-4 with a 10k-step warm-up stage. After the warm-up stage, the learning rate decays linearly to zero. We set the dropout probability to 0.1 and weight decay to 0.01. There are three additional hyper-parameters for TNF, half window size k, discount factor λ and weight γ. We set k as 16, λ as 0.5, γ as 0.1 for the main experiment, except for ELECTRA k is empirically set as 32. |