reproducibilityindex.ai

Physics of Language Models: Part 3.1, Knowledge Storage and Extraction

Authors: Zeyuan Allen-Zhu, Yuanzhi Li

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we investigate this issue using a controlled biography dataset. We find a strong correlation between the model s ability to extract knowledge and various diversity measures of the training data. ... We pretrain the language model on the biography dataset of all the 100k people. ... We then test its ability to out-of-distribution answer QAs about the remaining 1 p fraction.
Researcher Affiliation	Collaboration	1Meta / FAIR Labs, USA 2MBZUAI, UAE. Correspondence to: Zeyuan Allen-Zhu <zeyuanallenzhu@meta.com>, Yuanzhi Li <Yuanzhi.Li@mbzuai.ac.ae>.
Pseudocode	No	The paper describes training procedures but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	No	No explicit statement or link providing access to the source code for the methodology described in the paper.
Open Datasets	No	The paper describes synthetic datasets (bio S, bio R) that were created for the study, including their generation process, but does not provide a public link, DOI, or formal citation for accessing these specific datasets for training.
Dataset Splits	Yes	After pretraining the model on the entire biography, we fine-tune it using question and answer (QA) pairs from a p fraction of individuals. We then test its ability to out-of-distribution answer QAs about the remaining 1 p fraction. ... It is then finetuned using QAs from half of these individuals, denoted as Ptrain, without further use of biographies. The model s generalization is evaluated on questions related to the remaining half, denoted as Ptest.
Hardware Specification	No	The paper mentions model sizes (e.g., 124M, 302M, 682M) but does not provide specific hardware details such as GPU models, CPU types, or memory used for experiments.
Software Dependencies	No	The paper mentions various model architectures and references (e.g., GPT2, LLaMA) but does not provide specific software dependencies with version numbers (e.g., PyTorch, TensorFlow, CUDA versions).
Experiment Setup	Yes	We use a parameter QAr to control the QA data amount, primarily setting QAr = 0.8 (a 2 : 8 BIO to QA entry ratio). The standard GPT2-small architecture comprises 12 layers with 12 heads and 768 dimensions... We retain the GPT2 small architecture (124M) for pre-training on the bio S data, but use a larger 12-layer, 20-head, 1280-dim GPT (302M) for the bio R data... We apply a low-rank update to the query/value matrices of the transformer model... and the embedding layer... For Lo RA fine-tune we consider a rank r = 2, 4, 8, 16, 32 update on the query/value (q/v) matrices and a rank r = 0, 16, 32, 64, 128 update on the word embedding matrix.