In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness

Authors: Liam Collins, Advait Parulekar, Aryan Mokhtari, Sujay Sanghavi, Sanjay Shakkottai

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify all of these results with empirical simulations (Section 3.2 and Appendix J).
Researcher Affiliation Academia Liam Collins Chandra Family Department of ECE The University of Texas at Austin liamc@utexas.edu
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Please see supplementary material.
Open Datasets No The paper generates data based on specified distributions and does not use a pre-existing publicly available dataset. 'f D(F), x1, . . . , xn+1 i.i.d. D (n+1) x , ϵ1, . . . , ϵn i.i.d. D (n+1) ϵ'
Dataset Splits No The paper describes a pretraining protocol and evaluation on new tasks, but it does not specify explicit training/validation/test dataset splits with percentages or sample counts.
Hardware Specification No All experiments were run in Google Colab in a CPU runtime.
Software Dependencies No All training was executed in Py Torch with the Adam optimizer.
Experiment Setup Yes In all cases we use the Adam optimizer with one task sampled per round, use the noise distribution Dϵ = N(0, σ2), and run 10 trials and plot means and standard deviations over these 10 trials. We use an exponentially decaying learning rate schedule with factor 0.999. In Figures 3 and 5 we use initial learning rate 0.1 and in Figure 4 we use an initial learning rate 0.01.