reproducibilityindex.ai

Test-Time Training on Nearest Neighbors for Large Language Models

Authors: Moritz Hardt, Yu Sun

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Surprisingly, retrieving and training on as few as 20 neighbors, each for only one gradient iteration, drastically improves performance across more than 20 language modeling tasks in the Pile.
Researcher Affiliation	Academia	Moritz Hardt Max Planck Institute for Intelligent Systems, T ubingen T ubingen AI Center, University of T ubingen Yu Sun Stanford University
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	1Code, index files, and model checkpoint: https://github.com/socialfoundations/tttlm.
Open Datasets	Yes	Our nearest neighbor index is based on text embeddings of the Pile training set. The entire dataset has 210M sequences and size 1.3TB. In addition, the Pile dataset has a validation set and a test set that we do not include in the index. The Pile dataset (Gao et al., 2020). For efficiency, we evaluate on 20% of the test set, corresponding to 42,916 sequences.
Dataset Splits	Yes	Our nearest neighbor index is based on text embeddings of the Pile training set. The entire dataset has 210M sequences and size 1.3TB. In addition, the Pile dataset has a validation set and a test set that we do not include in the index. The Pile dataset (Gao et al., 2020). For efficiency, we evaluate on 20% of the test set, corresponding to 42,916 sequences.
Hardware Specification	Yes	Figure 9 shows training cost in seconds per neighbor on a single NVIDIA A100 GPU.
Software Dependencies	No	The paper mentions software like the 'Hugging Face library', 'Eleuther-AI’s lm-evaluation-harness library (Gao et al., 2021)', and specific models (gpt2, gpt-large, gpt-neo-1.3B), but does not provide specific version numbers for underlying software dependencies or libraries like PyTorch, Python, or the Transformers library itself.
Experiment Setup	Yes	Beyond these design choices, the method requires no hyper-parameter tuning. A remarkable aspect is that we can simply reuse the default hyper-parameters for the model and the optimizer available for each model in the Hugging Face library. We use a learning rate of 2e-5 for the Adam optimizer with ϵ value 1e-08. The maximum sequence length of the model is 1048 tokens.