How Do Large Language Models Acquire Factual Knowledge During Pretraining?

Authors: Hoyeon Chang, Jinho Park, Seonghyeon Ye, Sohee Yang, Youngkyung Seo, Du-Seong Chang, Minjoon Seo

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This work addresses this gap by studying how LLMs acquire factual knowledge during pretraining. The findings reveal several important insights into the dynamics of factual knowledge acquisition during pretraining.
Researcher Affiliation Collaboration 1KAIST {retapurayo, binlepain178, seonghyeon.ye, minjoon}@kaist.ac.kr 2UCL sohee.yang.22@ucl.ac.uk 3KT {yg.seo, dschang}@kt.com
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code and data are available at: https://github.com/kaist-AI/factual-knowledge-acquisition/
Open Datasets Yes All instances for the injected knowledge and probes are generated by prompting GPT-4 [2] using the definitions from the ECBD dataset [35] as a template... we resume pretraining OLMo [21] intermediate checkpoints ... using the pretraining data of OLMo (Dolma v1.5 [43])
Dataset Splits No The paper describes evaluation methods and probes but does not specify a distinct validation dataset split (e.g., 80/10/10 split) in the conventional sense for model training and hyperparameter tuning. The experiments focus on continuous pretraining dynamics with injected knowledge.
Hardware Specification Yes Continued pretraining of a total of 2500 steps takes approximately 3 days using 8 80GB A100 GPUs.
Software Dependencies No The paper mentions using OLMo [21] and Dolma v1.5 [43] but does not provide specific version numbers for underlying software libraries, frameworks, or programming languages (e.g., Python, PyTorch versions).
Experiment Setup Yes Specifically, we resume pretraining OLMo [21] intermediate checkpoints restoring the optimizer and scheduler states the same way OLMo is pretrained... We inject factual knowledge every 100 training steps by replacing a part of original pretraining batch with the injected knowledge of the FICTIONAL KNOWLEDGE dataset... We compare training LLMs with a batch size reduced by a factor of 16 compared to the original pretraining batch size, i.e., from 2048 to 128. The differences in initial learning rate values for each case based on different model sizes and pretraining stages are recorded in Table 5 below.