Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection
Authors: Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, Song Mei
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work first provides a comprehensive statistical theory for transformers to perform ICL. Concretely, we show that transformers can implement a broad class of standard machine learning algorithms in context, such as least squares, ridge regression, Lasso, learning generalized linear models, and gradient descent on two-layer neural networks, with near-optimal predictive power on various incontext data distributions. ... We both establish this in theory by explicit constructions, and also observe this phenomenon experimentally. ... Experimentally, we demonstrate the strong in-context algorithm selection capabilities of standard transformer architectures. |
| Researcher Affiliation | Collaboration | Yu Bai Salesforce Research yu.bai@salesforce.com Fan Chen Massachusetts Institute of Technology fanchen@mit.edu Huan Wang Salesforce Research huan.wang@salesforce.com Caiming Xiong Salesforce Research cxiong@salesforce.com Song Mei UC Berkeley songmei@berkeley.edu |
| Pseudocode | No | No pseudocode or algorithm blocks were found. |
| Open Source Code | Yes | Code is available at https://github.com/allenbai01/transformers-as-statisticians. |
| Open Datasets | No | The paper describes generating synthetic data: 'we sample the training instances from one of the following base distributions (tasks), where we first sample P = Pw π by sampling w N(0, Id/d), and then sample {(xi, yi)}i [N+1] iid Pw'. No link or citation to a publicly available dataset of the generated data is provided. |
| Dataset Splits | No | The paper discusses a 'train-validation split' in the context of their algorithm (Post-ICL validation mechanism), but does not specify fixed percentages or sample counts for an overall train/validation split of the data used to train their transformer model. |
| Hardware Specification | Yes | All our experiments are performed on 8 Nvidia Tesla A100 GPUs (40GB memory). |
| Software Dependencies | No | The paper mentions 'Adam optimizer' and 'GPT-2' but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We train a 12-layer transformer, with two modes for the training sequence (instance) distribution π. In the base mode... we use the square loss when P is regression data, and the logistic loss when P is classification data. We use the Adam optimizer with a fixed learning rate 10 4... for 300K steps, where each step consists of a (fresh) minibatch with batch size 64 in the base mode, and K minibatches each with batch size 64 in the mixture mode. |