reproducibilityindex.ai

Taking Notes on the Fly Helps Language Pre-Training

Authors: Qiyu Wu, Chen Xing, Yatao Li, Guolin Ke, Di He, Tie-Yan Liu

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We implement TNF on both BERT and ELECTRA to check its efﬁciency and effectiveness. Experimental results show that TNF s training time is 60% less than its backbone pre-training models when reaching the same performance.
Researcher Affiliation	Collaboration	1Peking University 2College of Compute Science, Nankai University 3Microsoft Research
Pseudocode	No	The paper describes the method in text and diagrams but does not provide pseudocode or algorithm blocks.
Open Source Code	Yes	Source code is attached in the supplementary material.
Open Datasets	Yes	Following BERT (Devlin et al., 2018), we use the English Wikipedia corpus and Book Corpus (Zhu et al., 2015) for pre-training.
Dataset Splits	Yes	Each conﬁguration is run ﬁve times with different random seeds, and the average of these ﬁve results on the validation set is calculated as the ﬁnal performance of one conﬁguration.
Hardware Specification	Yes	All models are run on 16 NVIDIA Tesla V100 GPUs with mixed-precision (Micikevicius et al., 2017).
Software Dependencies	No	The paper states 'All codes are implemented based on fairseq (Ott et al., 2019) in Py Torch (Paszke et al., 2017)' but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	All models are pre-trained for 1000k steps with batch size 256 and maximum sequence length 512. We use Adam (Kingma & Ba, 2014) as the optimizer, and set its hyperparameter ϵ to 1e-6 and (β1, β2) to (0.9, 0.98). The peak learning rate is set to 1e-4 with a 10k-step warm-up stage. After the warm-up stage, the learning rate decays linearly to zero. We set the dropout probability to 0.1 and weight decay to 0.01. There are three additional hyper-parameters for TNF, half window size k, discount factor λ and weight γ. We set k as 16, λ as 0.5, γ as 0.1 for the main experiment, except for ELECTRA k is empirically set as 32.