reproducibilityindex.ai

Linear Transformers are Versatile In-Context Learners

Authors: Max Vladymyrov, Johannes von Oswald, Mark Sandler, Rong Ge

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we prove that each layer of a linear transformer maintains a weight vector for an implicit linear regression problem and can be interpreted as performing a variant of preconditioned gradient descent. We also investigate the use of linear transformers in a challenging scenario where the training data is corrupted with different levels of noise. Remarkably, we demonstrate that for this problem linear transformers discover an intricate and highly effective optimization algorithm, surpassing or matching in performance many reasonable baselines. Our experiments with two different noise variance distributions (uniform and categorical) demonstrate the remarkable flexibility of linear transformers.
Researcher Affiliation	Collaboration	Max Vladymyrov Google Research mxv@google.com Johannes von Oswald Google, Paradigms of Intelligence Team jvoswald@google.com Mark Sandler Google Research sandler@google.com Rong Ge Duke University rongge@cs.duke.edu
Pseudocode	No	The paper presents mathematical formulas and descriptions of algorithms but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	We believe that we provide sufficient details on data generation, model variants, training parameters, and evaluation metrics to allow anyone to replicate the core experiments and validate the main claims. The code, while net released at this stage, will be released after the paper s acceptance.
Open Datasets	No	As a model problem, we consider data generated from a noisy linear regression model. For each input sequence τ, we sample a ground-truth weight vector wτ N(0, I), and generate n data points as xi N(0, I) and yi = wτ, xi + ξi, with noise ξi N(0, σ2 τ). We consider two different problems within the noisy linear regression framework.
Dataset Splits	No	The paper generates synthetic data for each sequence, where (xi, yi) form the in-context learning examples and (xt, 0) serves as the test query for that sequence. The evaluation is performed on "novel sequences." There is no explicit mention of a separate validation dataset split from a larger, fixed dataset in the traditional sense.
Hardware Specification	Yes	All the experiments were done on a single H100 GPU with 80GB of VRAM.
Software Dependencies	No	The paper mentions using the Adam optimizer but does not specify any software libraries or their version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For each experiment, we train each linear transformer modiﬁcations with a varying number of layers (1 to 7) using using Adam optimizer for 200 000 iterations with a learning rate of 0.0001 and a batch size of 2 048. In some cases, especially for a large number of layers, we had to adjust the learning rate to prevent stability issues. We report the best result out of 5 runs with different training seeds.