The Closeness of In-Context Learning and Weight Shifting for Softmax Regression

Authors: Shuai Li, Zhao Song, Yu Xia, Tong Yu, Tianyi Zhou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present our numerical experiments to validate our theoretical results that when training self-attention-only Transformers for softmax regression tasks, the models learned by gradient-descent and Transformers show great similarity.
Researcher Affiliation Collaboration Shuai Li Shanghai Jiao Tong University shuaili8@sjtu.edu.cn Zhao Song Simons Institute for the Theory of Computing, UC Berkeley magic.linuxkde@gmail.com Yu Xia University of California, San Diego yux078@ucsd.edu Tong Yu Adobe Research tyu@adobe.com Tianyi Zhou University of Southern California tzhou029@usc.edu
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The data and code are planned to be released upon acceptance and approval.
Open Datasets No According to Definition 1.3, we construct the synthetic softmax regression tasks consists of randomly sampled length-n documents A Rn d where each word has the d-dimensional embedding and targets b Rn. Each document is generated from a unique random seed. The paper does not provide concrete access information (link, DOI, formal citation) for a publicly available or open dataset.
Dataset Splits No To compare the trained single self-attention layer with a softmax unit and the softmax regression model trained with one-step gradient descent, we sample 10^3 tasks and record the losses of two models. While a 'training set' of tasks is mentioned for learning rate, explicit train/validation/test splits of a dataset are not described in the typical sense for reproducibility.
Hardware Specification Yes All experiments run on a single NVIDIA RTX2080Ti GPU with 10 independent repetitions.
Software Dependencies No The paper does not specify any software versions or library dependencies required for replication.
Experiment Setup Yes For the single self-attention layer with a softmax unit, we choose the learning rate ηSA = 0.005. For the softmax regression model, we determine the optimal learning rate ηGD by minimizing the ℓ2 regression loss over a training set of 103 tasks through line search.