Transformers as Algorithms: Generalization and Stability in In-context Learning
Authors: Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, Samet Oymak
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we provide numerical evaluations that (1) demonstrate transformers can indeed implement near-optimal algorithms on classical regression problems with i.i.d. and dynamic data, (2) provide insights on stability, and (3) verify our theoretical predictions. |
| Researcher Affiliation | Academia | 1{yli692,mildi001}ucr.edu, University of California, Riverside. 2dimitris@papail.io, University of Wisconsin, Madison. 3University of Michigan, Ann Arbor. Correspondence to: Samet Oymak <oymak@umich.edu>. |
| Pseudocode | No | The paper describes algorithms and methods in prose and mathematical notation but does not contain any formal pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | Our code is available at https://github.com/ yingcong-li/transformers-as-algorithms. |
| Open Datasets | No | The paper describes using synthetic or generated datasets for experiments (e.g., 'random linear regression tasks', 'linear data with covariance prior', 'partially-observed LDS') and refers to code for generation, but does not provide specific access information (links, DOIs, citations to publicly available versions) for these datasets themselves. |
| Dataset Splits | No | The paper does not specify explicit training, validation, or test dataset split percentages or sample counts. It refers to training and evaluation on tasks, but not in terms of common data splits for reproducibility. |
| Hardware Specification | No | The paper describes the GPT-2 architecture used but does not provide specific hardware details such as GPU/CPU models, memory, or computational resources used for training or evaluation. |
| Software Dependencies | No | The paper mentions 'Python 3.8' and 'Adam optimizer' but does not list specific version numbers for other key software libraries or frameworks (e.g., PyTorch, TensorFlow) that would be necessary for reproduction. |
| Experiment Setup | Yes | All experiments use learning rate 0.0001 and Adam optimizer. For Fig. 2(c) and Fig. 11, we fix the batch size to 64 and train with 500k/100k iterations. |