Transformers as Algorithms: Generalization and Stability in In-context Learning

Authors: Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, Samet Oymak

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we provide numerical evaluations that (1) demonstrate transformers can indeed implement near-optimal algorithms on classical regression problems with i.i.d. and dynamic data, (2) provide insights on stability, and (3) verify our theoretical predictions.
Researcher Affiliation Academia 1{yli692,mildi001}ucr.edu, University of California, Riverside. 2dimitris@papail.io, University of Wisconsin, Madison. 3University of Michigan, Ann Arbor. Correspondence to: Samet Oymak <oymak@umich.edu>.
Pseudocode No The paper describes algorithms and methods in prose and mathematical notation but does not contain any formal pseudocode blocks or algorithm listings.
Open Source Code Yes Our code is available at https://github.com/ yingcong-li/transformers-as-algorithms.
Open Datasets No The paper describes using synthetic or generated datasets for experiments (e.g., 'random linear regression tasks', 'linear data with covariance prior', 'partially-observed LDS') and refers to code for generation, but does not provide specific access information (links, DOIs, citations to publicly available versions) for these datasets themselves.
Dataset Splits No The paper does not specify explicit training, validation, or test dataset split percentages or sample counts. It refers to training and evaluation on tasks, but not in terms of common data splits for reproducibility.
Hardware Specification No The paper describes the GPT-2 architecture used but does not provide specific hardware details such as GPU/CPU models, memory, or computational resources used for training or evaluation.
Software Dependencies No The paper mentions 'Python 3.8' and 'Adam optimizer' but does not list specific version numbers for other key software libraries or frameworks (e.g., PyTorch, TensorFlow) that would be necessary for reproduction.
Experiment Setup Yes All experiments use learning rate 0.0001 and Adam optimizer. For Fig. 2(c) and Fig. 11, we fix the batch size to 64 and train with 500k/100k iterations.