Linear Transformers are Versatile In-Context Learners

Authors: Max Vladymyrov, Johannes von Oswald, Mark Sandler, Rong Ge

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we prove that each layer of a linear transformer maintains a weight vector for an implicit linear regression problem and can be interpreted as performing a variant of preconditioned gradient descent. We also investigate the use of linear transformers in a challenging scenario where the training data is corrupted with different levels of noise. Remarkably, we demonstrate that for this problem linear transformers discover an intricate and highly effective optimization algorithm, surpassing or matching in performance many reasonable baselines. Our experiments with two different noise variance distributions (uniform and categorical) demonstrate the remarkable flexibility of linear transformers.
Researcher Affiliation Collaboration Max Vladymyrov Google Research mxv@google.com Johannes von Oswald Google, Paradigms of Intelligence Team jvoswald@google.com Mark Sandler Google Research sandler@google.com Rong Ge Duke University rongge@cs.duke.edu
Pseudocode No The paper presents mathematical formulas and descriptions of algorithms but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No We believe that we provide sufficient details on data generation, model variants, training parameters, and evaluation metrics to allow anyone to replicate the core experiments and validate the main claims. The code, while net released at this stage, will be released after the paper s acceptance.
Open Datasets No As a model problem, we consider data generated from a noisy linear regression model. For each input sequence τ, we sample a ground-truth weight vector wτ N(0, I), and generate n data points as xi N(0, I) and yi = wτ, xi + ξi, with noise ξi N(0, σ2 τ). We consider two different problems within the noisy linear regression framework.
Dataset Splits No The paper generates synthetic data for each sequence, where (xi, yi) form the in-context learning examples and (xt, 0) serves as the test query for that sequence. The evaluation is performed on "novel sequences." There is no explicit mention of a separate validation dataset split from a larger, fixed dataset in the traditional sense.
Hardware Specification Yes All the experiments were done on a single H100 GPU with 80GB of VRAM.
Software Dependencies No The paper mentions using the Adam optimizer but does not specify any software libraries or their version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For each experiment, we train each linear transformer modifications with a varying number of layers (1 to 7) using using Adam optimizer for 200 000 iterations with a learning rate of 0.0001 and a batch size of 2 048. In some cases, especially for a large number of layers, we had to adjust the learning rate to prevent stability issues. We report the best result out of 5 runs with different training seeds.