One-Layer Transformer Provably Learns One-Nearest Neighbor In Context

Authors: Zihao Li, Yuan Cao, Cheng Gao, Yihan He, Han Liu, Jason Klusowski, Jianqing Fan, Mengdi Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our result gives a concrete example of how transformers can be trained to implement nonparametric machine learning algorithms, and sheds light on the role of softmax attention in transformer models. We further justify our results with numerical simulations. 5 Numerical Results
Researcher Affiliation Academia 1Princeton University 2The University of Hong Kong 3Northwestern University
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain a concrete statement about open-sourcing the code for its methodology or provide a direct link to a code repository.
Open Datasets No The paper describes a synthetic data generation process in Assumption 1 and Appendix B, rather than using or providing access to a publicly available dataset. For instance, in Appendix B it states: 'In our experiment, we generate the testing dataset... by the following procedure: (i) We sample xi from the uniform distribution on Sd 1 and yi N(0, 1) for all i [N]; (ii) We random sample i [N] by uniform distribution and set x N+1 = xi with the 1-NN label being yi ; (iii) If xj x N+1 2 2 δ, set xj = xj.'
Dataset Splits No The paper states, 'The model is trained on a dataset with a size of 10000, and an epoch number of 2000.' and 'We test the trained transformer model on this dataset once every epoch throughout the training process.' While it discusses training and testing data, it does not explicitly mention a separate validation set or specific train/validation/test splits.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud resources) used for running the experiments.
Software Dependencies No The paper mentions using SGD as an optimizer but does not specify any software dependencies (e.g., programming languages, libraries, or frameworks) with version numbers.
Experiment Setup Yes We choose context length N {16, 32, 64} and input dimension d {8, 16}. The model is trained on a dataset with a size of 10000, and an epoch number of 2000. To ensure our training convergence result holds beyond the gradient descent scheme, we choose SGD as our optimizer, with a batch size of 128 and a learning rate of 0.1.