In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness
Authors: Liam Collins, Advait Parulekar, Aryan Mokhtari, Sujay Sanghavi, Sanjay Shakkottai
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify all of these results with empirical simulations (Section 3.2 and Appendix J). |
| Researcher Affiliation | Academia | Liam Collins Chandra Family Department of ECE The University of Texas at Austin liamc@utexas.edu |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Please see supplementary material. |
| Open Datasets | No | The paper generates data based on specified distributions and does not use a pre-existing publicly available dataset. 'f D(F), x1, . . . , xn+1 i.i.d. D (n+1) x , ϵ1, . . . , ϵn i.i.d. D (n+1) ϵ' |
| Dataset Splits | No | The paper describes a pretraining protocol and evaluation on new tasks, but it does not specify explicit training/validation/test dataset splits with percentages or sample counts. |
| Hardware Specification | No | All experiments were run in Google Colab in a CPU runtime. |
| Software Dependencies | No | All training was executed in Py Torch with the Adam optimizer. |
| Experiment Setup | Yes | In all cases we use the Adam optimizer with one task sampled per round, use the noise distribution Dϵ = N(0, σ2), and run 10 trials and plot means and standard deviations over these 10 trials. We use an exponentially decaying learning rate schedule with factor 0.999. In Figures 3 and 5 we use initial learning rate 0.1 and in Figure 4 we use an initial learning rate 0.01. |