reproducibilityindex.ai

Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection

Authors: Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, Song Mei

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This work first provides a comprehensive statistical theory for transformers to perform ICL. Concretely, we show that transformers can implement a broad class of standard machine learning algorithms in context, such as least squares, ridge regression, Lasso, learning generalized linear models, and gradient descent on two-layer neural networks, with near-optimal predictive power on various incontext data distributions. ... We both establish this in theory by explicit constructions, and also observe this phenomenon experimentally. ... Experimentally, we demonstrate the strong in-context algorithm selection capabilities of standard transformer architectures.
Researcher Affiliation	Collaboration	Yu Bai Salesforce Research yu.bai@salesforce.com Fan Chen Massachusetts Institute of Technology fanchen@mit.edu Huan Wang Salesforce Research huan.wang@salesforce.com Caiming Xiong Salesforce Research cxiong@salesforce.com Song Mei UC Berkeley songmei@berkeley.edu
Pseudocode	No	No pseudocode or algorithm blocks were found.
Open Source Code	Yes	Code is available at https://github.com/allenbai01/transformers-as-statisticians.
Open Datasets	No	The paper describes generating synthetic data: 'we sample the training instances from one of the following base distributions (tasks), where we first sample P = Pw π by sampling w N(0, Id/d), and then sample {(xi, yi)}i [N+1] iid Pw'. No link or citation to a publicly available dataset of the generated data is provided.
Dataset Splits	No	The paper discusses a 'train-validation split' in the context of their algorithm (Post-ICL validation mechanism), but does not specify fixed percentages or sample counts for an overall train/validation split of the data used to train their transformer model.
Hardware Specification	Yes	All our experiments are performed on 8 Nvidia Tesla A100 GPUs (40GB memory).
Software Dependencies	No	The paper mentions 'Adam optimizer' and 'GPT-2' but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	We train a 12-layer transformer, with two modes for the training sequence (instance) distribution π. In the base mode... we use the square loss when P is regression data, and the logistic loss when P is classification data. We use the Adam optimizer with a fixed learning rate 10 4... for 300K steps, where each step consists of a (fresh) minibatch with batch size 64 in the base mode, and K minibatches each with batch size 64 in the mixture mode.