reproducibilityindex.ai

How Do Large Language Models Acquire Factual Knowledge During Pretraining?

Authors: Hoyeon Chang, Jinho Park, Seonghyeon Ye, Sohee Yang, Youngkyung Seo, Du-Seong Chang, Minjoon Seo

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This work addresses this gap by studying how LLMs acquire factual knowledge during pretraining. The findings reveal several important insights into the dynamics of factual knowledge acquisition during pretraining.
Researcher Affiliation	Collaboration	1KAIST {retapurayo, binlepain178, seonghyeon.ye, minjoon}@kaist.ac.kr 2UCL sohee.yang.22@ucl.ac.uk 3KT {yg.seo, dschang}@kt.com
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data are available at: https://github.com/kaist-AI/factual-knowledge-acquisition/
Open Datasets	Yes	All instances for the injected knowledge and probes are generated by prompting GPT-4 [2] using the definitions from the ECBD dataset [35] as a template... we resume pretraining OLMo [21] intermediate checkpoints ... using the pretraining data of OLMo (Dolma v1.5 [43])
Dataset Splits	No	The paper describes evaluation methods and probes but does not specify a distinct validation dataset split (e.g., 80/10/10 split) in the conventional sense for model training and hyperparameter tuning. The experiments focus on continuous pretraining dynamics with injected knowledge.
Hardware Specification	Yes	Continued pretraining of a total of 2500 steps takes approximately 3 days using 8 80GB A100 GPUs.
Software Dependencies	No	The paper mentions using OLMo [21] and Dolma v1.5 [43] but does not provide specific version numbers for underlying software libraries, frameworks, or programming languages (e.g., Python, PyTorch versions).
Experiment Setup	Yes	Specifically, we resume pretraining OLMo [21] intermediate checkpoints restoring the optimizer and scheduler states the same way OLMo is pretrained... We inject factual knowledge every 100 training steps by replacing a part of original pretraining batch with the injected knowledge of the FICTIONAL KNOWLEDGE dataset... We compare training LLMs with a batch size reduced by a factor of 16 compared to the original pretraining batch size, i.e., from 2048 to 128. The differences in initial learning rate values for each case based on different model sizes and pretraining stages are recorded in Table 5 below.