Large Language Models Struggle to Learn Long-Tail Knowledge

Authors: Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, Colin Raffel

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we study the relationship between the knowledge memorized by large language models and the information in pre-training datasets scraped from the web. In particular, we show that a language model s ability to answer a fact-based question relates to how many documents associated with that question were seen during pre-training. We identify these relevant documents by entity linking pre-training datasets and counting documents that contain the same entities as a given questionanswer pair. Our results demonstrate strong correlational and causal relationships between accuracy and relevant document count for numerous question answering datasets (e.g., Trivia QA), pretraining corpora (e.g., ROOTS), and model sizes (e.g., 176B parameters). Moreover, while larger models are better at learning long-tail knowledge, we estimate that today s models must be scaled by many orders of magnitude to reach competitive QA performance on questions with little support in the pre-training data.
Researcher Affiliation Collaboration 1UNC Chapel Hill 2Google Research 3UC Berkeley.
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes To enable future research, we release our code as well as the entity data for ROOTS, The Pile, C4, Open Web Text, and Wikipedia at https://github.com/nkandpa2/long tail knowledge.
Open Datasets Yes To identify these entity co-occurrences we apply a highlyparallelized entity linking pipeline to trillions of tokens from datasets such as C4 (Raffel et al., 2020), The Pile (Gao et al., 2020), ROOTS (Laurenc on et al., 2022), Open Web Text (Gokaslan & Cohen, 2019), and Wikipedia. ... We next entity link two standard open-domain QA datasets: Natural Questions (Kwiatkowski et al., 2019) and Trivia QA (Joshi et al., 2017).
Dataset Splits Yes We next entity link two standard open-domain QA datasets: Natural Questions (Kwiatkowski et al., 2019) and Trivia QA (Joshi et al., 2017). To expand our sample sizes, we use both the training and validation data, except for a small set of examples used for few-shot learning prompts.
Hardware Specification Yes This pipeline took approximately 3 weeks to entity link 2.1TB of data on a 128-CPU-core machine.
Software Dependencies No The paper mentions using "DBpedia Spotlight Entity Linker" but does not specify its version number or any other software dependencies with version information.
Experiment Setup Yes We focus on 4-shot evaluation, although we found that other amounts of incontext training examples produced similar trends. We use simple prompts consisting of templates of the form Q: [In-Context Question 1] A: [In-Context Answer 1] ... Q: [In-Context Question n] A: [In-Context Answer n] Q: [Test Question]. We generate answers by greedy decoding until the models generate a newline character, and we evaluate answers using the standard Exatch Match (EM) metric against the groundtruth answer set (Rajpurkar et al., 2016). ... We first train a baseline 4.8 billion parameter LM on C4, following the setup from Wang et al. (2022). ... Finally, we train a counterfactual LM on this modified pre-training dataset and compare its performance to the baseline model. For both the baseline model and the counterfactual model, we train for a single epoch.