Generalization through Memorization: Nearest Neighbor Language Models
Authors: Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, Mike Lewis
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct an extensive empirical evaluation. Applying our k NN augmentation to a strong WIKITEXT-103 LM using only the original dataset achieves a new stateof-the-art perplexity of 15.79 a 2.86 point improvement over the base model (Baevski & Auli, 2019) with no additional training. |
| Researcher Affiliation | Collaboration | Stanford University Facebook AI Research {urvashik,jurafsky}@stanford.edu {omerlevy,lsz,mikelewis}@fb.com |
| Pseudocode | No | The paper includes 'Figure 1: An illustration of k NN-LM,' which is a diagrammatic representation, but it does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at: https://github.com/urvashik/knnlm |
| Open Datasets | Yes | WIKITEXT-103 is a standard benchmark by Merity et al. (2017) for autoregressive language modeling with a 250K word-level vocabulary. It consists of 103M tokens of Wikipedia in the training set and 250K tokens in each of the development and test sets. |
| Dataset Splits | Yes | WIKITEXT-103... It consists of 103M tokens of Wikipedia in the training set and 250K tokens in each of the development and test sets. |
| Hardware Specification | No | The paper mentions that 'building the cache with 103M entries takes roughly two hours on a single CPU' and 'requires no GPU-based training,' but it does not specify exact CPU or GPU models, or other detailed hardware specifications. |
| Software Dependencies | No | The paper states, 'To search over this large datastore, we use FAISS (Johnson et al., 2017), an open source library,' and mentions 'the 29K subword vocabulary from BERT (Devlin et al., 2019),' but it does not provide specific version numbers for these or other software components like Python or PyTorch. |
| Experiment Setup | Yes | This model consists of 16 layers, each with 16 self-attention heads, 1024 dimensional hidden states, and 4096 dimensional feedforward layers, amounting to 247M trainable parameters. It processes 3072 tokens of context per example for WIKITEXT-103 and 1024 tokens for the rest of the corpora. [...] During inference, we retrieve k = 1024 neighbors, and the index looks up 32 cluster centroids [...] We tune the interpolation parameter λ on the validation set. |