Why do Nearest Neighbor Language Models Work?
Authors: Frank F. Xu, Uri Alon, Graham Neubig
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To this end, we perform analysis of various dimensions over which k NN-LM diverges from standard LMs, and investigate these dimensions one by one. Empirically, we identify three main reasons why k NN-LM performs better than standard LMs: using a different input representation for predicting the next tokens, approximate k NN search, and the importance of softmax temperature for the k NN distribution. Further, we incorporate some insights into the standard parametric LM, improving performance without the need for an explicit retrieval component. The code is available at https://github.com/ frankxu2004/knnlm-why. |
| Researcher Affiliation | Academia | 1Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States. Correspondence to: Frank F. Xu <fangzhex@cs.cmu.edu>, Graham Neubig <gneubig@cs.cmu.edu>. |
| Pseudocode | No | No pseudocode or algorithm blocks were found. Figure 1 is an illustration of an equation, not an algorithm. |
| Open Source Code | Yes | The code is available at https://github.com/ frankxu2004/knnlm-why. |
| Open Datasets | Yes | First, we evaluate k NN-LM on Wikitext-103 (Merity et al., 2016) |
| Dataset Splits | Yes | The interpolated perplexity is computed with optimal interpolation parameter λ tuned according to the perplexity on the development set, and fixed during inference. |
| Hardware Specification | No | The paper mentions 'fit the entire matrix in the GPU' but does not provide specific hardware details such as GPU model, CPU, or memory specifications used for running experiments. |
| Software Dependencies | No | The paper mentions using the 'FAISS library' but does not provide any specific version numbers for FAISS or other software dependencies required to replicate the experiments. |
| Experiment Setup | Yes | Following Khandelwal et al. (2020b), at every retrieval step, we take the top 1024 nearest neighbors, i.e., k = 1024. The interpolated perplexity is computed with optimal interpolation parameter λ tuned according to the perplexity on the development set, and fixed during inference. We experiment with both the 5% as well as the full datastore using different temperatures ranging from 0 to 3 at 0.1 intervals. |