Test-Time Training on Nearest Neighbors for Large Language Models
Authors: Moritz Hardt, Yu Sun
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Surprisingly, retrieving and training on as few as 20 neighbors, each for only one gradient iteration, drastically improves performance across more than 20 language modeling tasks in the Pile. |
| Researcher Affiliation | Academia | Moritz Hardt Max Planck Institute for Intelligent Systems, T ubingen T ubingen AI Center, University of T ubingen Yu Sun Stanford University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Code, index files, and model checkpoint: https://github.com/socialfoundations/tttlm. |
| Open Datasets | Yes | Our nearest neighbor index is based on text embeddings of the Pile training set. The entire dataset has 210M sequences and size 1.3TB. In addition, the Pile dataset has a validation set and a test set that we do not include in the index. The Pile dataset (Gao et al., 2020). For efficiency, we evaluate on 20% of the test set, corresponding to 42,916 sequences. |
| Dataset Splits | Yes | Our nearest neighbor index is based on text embeddings of the Pile training set. The entire dataset has 210M sequences and size 1.3TB. In addition, the Pile dataset has a validation set and a test set that we do not include in the index. The Pile dataset (Gao et al., 2020). For efficiency, we evaluate on 20% of the test set, corresponding to 42,916 sequences. |
| Hardware Specification | Yes | Figure 9 shows training cost in seconds per neighbor on a single NVIDIA A100 GPU. |
| Software Dependencies | No | The paper mentions software like the 'Hugging Face library', 'Eleuther-AI’s lm-evaluation-harness library (Gao et al., 2021)', and specific models (gpt2, gpt-large, gpt-neo-1.3B), but does not provide specific version numbers for underlying software dependencies or libraries like PyTorch, Python, or the Transformers library itself. |
| Experiment Setup | Yes | Beyond these design choices, the method requires no hyper-parameter tuning. A remarkable aspect is that we can simply reuse the default hyper-parameters for the model and the optimizer available for each model in the Hugging Face library. We use a learning rate of 2e-5 for the Adam optimizer with ϵ value 1e-08. The maximum sequence length of the model is 1048 tokens. |