Test-Time Training on Nearest Neighbors for Large Language Models

Authors: Moritz Hardt, Yu Sun

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Surprisingly, retrieving and training on as few as 20 neighbors, each for only one gradient iteration, drastically improves performance across more than 20 language modeling tasks in the Pile.
Researcher Affiliation Academia Moritz Hardt Max Planck Institute for Intelligent Systems, T ubingen T ubingen AI Center, University of T ubingen Yu Sun Stanford University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes 1Code, index files, and model checkpoint: https://github.com/socialfoundations/tttlm.
Open Datasets Yes Our nearest neighbor index is based on text embeddings of the Pile training set. The entire dataset has 210M sequences and size 1.3TB. In addition, the Pile dataset has a validation set and a test set that we do not include in the index. The Pile dataset (Gao et al., 2020). For efficiency, we evaluate on 20% of the test set, corresponding to 42,916 sequences.
Dataset Splits Yes Our nearest neighbor index is based on text embeddings of the Pile training set. The entire dataset has 210M sequences and size 1.3TB. In addition, the Pile dataset has a validation set and a test set that we do not include in the index. The Pile dataset (Gao et al., 2020). For efficiency, we evaluate on 20% of the test set, corresponding to 42,916 sequences.
Hardware Specification Yes Figure 9 shows training cost in seconds per neighbor on a single NVIDIA A100 GPU.
Software Dependencies No The paper mentions software like the 'Hugging Face library', 'Eleuther-AI’s lm-evaluation-harness library (Gao et al., 2021)', and specific models (gpt2, gpt-large, gpt-neo-1.3B), but does not provide specific version numbers for underlying software dependencies or libraries like PyTorch, Python, or the Transformers library itself.
Experiment Setup Yes Beyond these design choices, the method requires no hyper-parameter tuning. A remarkable aspect is that we can simply reuse the default hyper-parameters for the model and the optimizer available for each model in the Hugging Face library. We use a learning rate of 2e-5 for the Adam optimizer with ϵ value 1e-08. The maximum sequence length of the model is 1048 tokens.