Continual Pre-training of Language Models

Authors: Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, Bing Liu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluation demonstrates the effectiveness of the proposed method. We use Ro BERTa (Liu et al., 2019)7 as the LM. Following the standard evaluation setup (Lange et al., 2019) and, after a domain is trained, its training data is discarded. After all domains are incrementally learned, the final model is evaluated by fine-tuning the end-tasks in all domains.
Researcher Affiliation Collaboration Zixuan Ke1 , Yijia Shao2 , Haowei Lin2 , Tatsuya Konishi3 , Gyuhak Kim1, and Bing Liu1 1Department of Computer Science, University of Illinois at Chicago 2Wangxuan Institute of Computer Technology, Peking University 3KDDI Research 1{zke4,gkim87,liub}@uic.edu 2{shaoyj,linhaowei}@pku.edu.cn 3tt-konishi@kddi-research.jp
Pseudocode No The paper describes the proposed technique using prose and illustrations (Figure 1) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes 1The code is available at https://github.com/UIC-Liu-Lab/Continual_LM
Open Datasets Yes Table 1 shows the statistics of the 6 unlabeled domain corpora for DAP-training and their 6 corresponding end-task classification datasets.8 3 of them are about reviews: Yelp Restaurant (Xu et al., 2019), Amazon Phone (Ni et al., 2019), Amazon Camera (Ni et al., 2019); 3 of them are academic papers: ACL Papers (Lo et al., 2020), AI Papers (Lo et al., 2020), and Pub Med Papers9. Their corresponding end-task classification datasets are:10 Restaurant11, Phone (Ding et al., 2008; Hu & Liu, 2004), Camera (Ding et al., 2008; Hu & Liu, 2004), ACL (ACL-ARC in (Jurgens et al., 2018)), AI (SCIERC in (Luan et al., 2018)), and Pub Med (CHEMPORT in (Kringelum et al., 2016)). 9https://pubmed.ncbi.nlm.nih.gov/ 11https://alt.qcri.org/semeval2014/task4/
Dataset Splits No We simply take the results for the last epoch, assuming no validation sets.
Hardware Specification No The paper does not provide any specific hardware details such as GPU or CPU models used for the experiments.
Software Dependencies No The paper mentions using 'Ro BERTa BASE as our backbone LM' and 'Adam optimizer' but does not specify version numbers for any software dependencies or libraries.
Experiment Setup Yes DAP-training. The learning rate is set to 1e-4 and batch size to 256. We train 2.5K steps for each domain, roughly a full pass through the domain data, following Gururangan et al. (2020); Xu et al. (2019). The subset of data {xsub n } for computing Limpt to determine head importance in Secs. 3.1 and 3.3 is set to 1.64 Million tokens, which is sufficient in our experiments. λ in Eq. 8 is set to 1 and τ in Eq. 7 is set to 0.05. End-task fine-tuning. The learning rate is set to 1e-5 and batch size to 16. We train on end-task fine-tuning datasets for 5 epochs for Restaurant; 10 epochs for ACL, AI and Pub Med; and 15 epochs for Phone and Camera.