Linear Transformers are Versatile In-Context Learners
Authors: Max Vladymyrov, Johannes von Oswald, Mark Sandler, Rong Ge
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we prove that each layer of a linear transformer maintains a weight vector for an implicit linear regression problem and can be interpreted as performing a variant of preconditioned gradient descent. We also investigate the use of linear transformers in a challenging scenario where the training data is corrupted with different levels of noise. Remarkably, we demonstrate that for this problem linear transformers discover an intricate and highly effective optimization algorithm, surpassing or matching in performance many reasonable baselines. Our experiments with two different noise variance distributions (uniform and categorical) demonstrate the remarkable flexibility of linear transformers. |
| Researcher Affiliation | Collaboration | Max Vladymyrov Google Research mxv@google.com Johannes von Oswald Google, Paradigms of Intelligence Team jvoswald@google.com Mark Sandler Google Research sandler@google.com Rong Ge Duke University rongge@cs.duke.edu |
| Pseudocode | No | The paper presents mathematical formulas and descriptions of algorithms but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | We believe that we provide sufficient details on data generation, model variants, training parameters, and evaluation metrics to allow anyone to replicate the core experiments and validate the main claims. The code, while net released at this stage, will be released after the paper s acceptance. |
| Open Datasets | No | As a model problem, we consider data generated from a noisy linear regression model. For each input sequence τ, we sample a ground-truth weight vector wτ N(0, I), and generate n data points as xi N(0, I) and yi = wτ, xi + ξi, with noise ξi N(0, σ2 τ). We consider two different problems within the noisy linear regression framework. |
| Dataset Splits | No | The paper generates synthetic data for each sequence, where (xi, yi) form the in-context learning examples and (xt, 0) serves as the test query for that sequence. The evaluation is performed on "novel sequences." There is no explicit mention of a separate validation dataset split from a larger, fixed dataset in the traditional sense. |
| Hardware Specification | Yes | All the experiments were done on a single H100 GPU with 80GB of VRAM. |
| Software Dependencies | No | The paper mentions using the Adam optimizer but does not specify any software libraries or their version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For each experiment, we train each linear transformer modifications with a varying number of layers (1 to 7) using using Adam optimizer for 200 000 iterations with a learning rate of 0.0001 and a batch size of 2 048. In some cases, especially for a large number of layers, we had to adjust the learning rate to prevent stability issues. We report the best result out of 5 runs with different training seeds. |