The Closeness of In-Context Learning and Weight Shifting for Softmax Regression
Authors: Shuai Li, Zhao Song, Yu Xia, Tong Yu, Tianyi Zhou
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present our numerical experiments to validate our theoretical results that when training self-attention-only Transformers for softmax regression tasks, the models learned by gradient-descent and Transformers show great similarity. |
| Researcher Affiliation | Collaboration | Shuai Li Shanghai Jiao Tong University shuaili8@sjtu.edu.cn Zhao Song Simons Institute for the Theory of Computing, UC Berkeley magic.linuxkde@gmail.com Yu Xia University of California, San Diego yux078@ucsd.edu Tong Yu Adobe Research tyu@adobe.com Tianyi Zhou University of Southern California tzhou029@usc.edu |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The data and code are planned to be released upon acceptance and approval. |
| Open Datasets | No | According to Definition 1.3, we construct the synthetic softmax regression tasks consists of randomly sampled length-n documents A Rn d where each word has the d-dimensional embedding and targets b Rn. Each document is generated from a unique random seed. The paper does not provide concrete access information (link, DOI, formal citation) for a publicly available or open dataset. |
| Dataset Splits | No | To compare the trained single self-attention layer with a softmax unit and the softmax regression model trained with one-step gradient descent, we sample 10^3 tasks and record the losses of two models. While a 'training set' of tasks is mentioned for learning rate, explicit train/validation/test splits of a dataset are not described in the typical sense for reproducibility. |
| Hardware Specification | Yes | All experiments run on a single NVIDIA RTX2080Ti GPU with 10 independent repetitions. |
| Software Dependencies | No | The paper does not specify any software versions or library dependencies required for replication. |
| Experiment Setup | Yes | For the single self-attention layer with a softmax unit, we choose the learning rate ηSA = 0.005. For the softmax regression model, we determine the optimal learning rate ηGD by minimizing the ℓ2 regression loss over a training set of 103 tasks through line search. |