Physics of Language Models: Part 3.1, Knowledge Storage and Extraction
Authors: Zeyuan Allen-Zhu, Yuanzhi Li
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we investigate this issue using a controlled biography dataset. We find a strong correlation between the model s ability to extract knowledge and various diversity measures of the training data. ... We pretrain the language model on the biography dataset of all the 100k people. ... We then test its ability to out-of-distribution answer QAs about the remaining 1 p fraction. |
| Researcher Affiliation | Collaboration | 1Meta / FAIR Labs, USA 2MBZUAI, UAE. Correspondence to: Zeyuan Allen-Zhu <zeyuanallenzhu@meta.com>, Yuanzhi Li <Yuanzhi.Li@mbzuai.ac.ae>. |
| Pseudocode | No | The paper describes training procedures but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | No explicit statement or link providing access to the source code for the methodology described in the paper. |
| Open Datasets | No | The paper describes synthetic datasets (bio S, bio R) that were created for the study, including their generation process, but does not provide a public link, DOI, or formal citation for accessing these specific datasets for training. |
| Dataset Splits | Yes | After pretraining the model on the entire biography, we fine-tune it using question and answer (QA) pairs from a p fraction of individuals. We then test its ability to out-of-distribution answer QAs about the remaining 1 p fraction. ... It is then finetuned using QAs from half of these individuals, denoted as Ptrain, without further use of biographies. The model s generalization is evaluated on questions related to the remaining half, denoted as Ptest. |
| Hardware Specification | No | The paper mentions model sizes (e.g., 124M, 302M, 682M) but does not provide specific hardware details such as GPU models, CPU types, or memory used for experiments. |
| Software Dependencies | No | The paper mentions various model architectures and references (e.g., GPT2, LLaMA) but does not provide specific software dependencies with version numbers (e.g., PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | We use a parameter QAr to control the QA data amount, primarily setting QAr = 0.8 (a 2 : 8 BIO to QA entry ratio). The standard GPT2-small architecture comprises 12 layers with 12 heads and 768 dimensions... We retain the GPT2 small architecture (124M) for pre-training on the bio S data, but use a larger 12-layer, 20-head, 1280-dim GPT (302M) for the bio R data... We apply a low-rank update to the query/value matrices of the transformer model... and the embedding layer... For Lo RA fine-tune we consider a rank r = 2, 4, 8, 16, 32 update on the query/value (q/v) matrices and a rank r = 0, 16, 32, 64, 128 update on the word embedding matrix. |