Generalization through Memorization: Nearest Neighbor Language Models

Authors: Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, Mike Lewis

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct an extensive empirical evaluation. Applying our k NN augmentation to a strong WIKITEXT-103 LM using only the original dataset achieves a new stateof-the-art perplexity of 15.79 a 2.86 point improvement over the base model (Baevski & Auli, 2019) with no additional training.
Researcher Affiliation Collaboration Stanford University Facebook AI Research {urvashik,jurafsky}@stanford.edu {omerlevy,lsz,mikelewis}@fb.com
Pseudocode No The paper includes 'Figure 1: An illustration of k NN-LM,' which is a diagrammatic representation, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at: https://github.com/urvashik/knnlm
Open Datasets Yes WIKITEXT-103 is a standard benchmark by Merity et al. (2017) for autoregressive language modeling with a 250K word-level vocabulary. It consists of 103M tokens of Wikipedia in the training set and 250K tokens in each of the development and test sets.
Dataset Splits Yes WIKITEXT-103... It consists of 103M tokens of Wikipedia in the training set and 250K tokens in each of the development and test sets.
Hardware Specification No The paper mentions that 'building the cache with 103M entries takes roughly two hours on a single CPU' and 'requires no GPU-based training,' but it does not specify exact CPU or GPU models, or other detailed hardware specifications.
Software Dependencies No The paper states, 'To search over this large datastore, we use FAISS (Johnson et al., 2017), an open source library,' and mentions 'the 29K subword vocabulary from BERT (Devlin et al., 2019),' but it does not provide specific version numbers for these or other software components like Python or PyTorch.
Experiment Setup Yes This model consists of 16 layers, each with 16 self-attention heads, 1024 dimensional hidden states, and 4096 dimensional feedforward layers, amounting to 247M trainable parameters. It processes 3072 tokens of context per example for WIKITEXT-103 and 1024 tokens for the rest of the corpora. [...] During inference, we retrieve k = 1024 neighbors, and the index looks up 32 cluster centroids [...] We tune the interpolation parameter λ on the validation set.