Looped Transformers are Better at Learning Learning Algorithms
Authors: Liu Yang, Kangwook Lee, Robert D Nowak, Dimitris Papailiopoulos
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results suggest that the looped transformer achieves performance comparable to the standard transformer in solving various data-fitting problems, while utilizing less than 10% of the parameter count. |
| Researcher Affiliation | Academia | Liu Yang, Kangwook Lee, Robert D. Nowak & Dimitris Papailiopoulos University of Wisconsin, Madison, USA {liu.yang, kangwook.lee, rdnowak}@wisc.edu, dimitris@papail.io |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Our code is available at https://github.com/Leiay/looped_transformer. |
| Open Datasets | Yes | we have conducted additional experiments using 10 datasets from Open ML (Vanschoren et al., 2013) |
| Dataset Splits | Yes | During training, we uniformly sampled prompts from 9 datasets, where for each prompt, we first randomly selected a training set, then randomly selected k + 1 samples from this training set, with k being the number of in-context samples. During testing, we applied a similar approach for each test sample, selecting k in-context samples from the test dataset, with care taken to exclude the test sample itself from these in-context pairs. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used to run its experiments with specific models or specifications. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer' and 'GPT-2 decoder model' but does not specify versions for programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | Specifically, we employ a GPT-2 model with an embedding dimension of D = 256 and h = 8 attention heads. The standard (unlooped) transformer has L = 12 layers, and the looped transformer has L = 1 layer. ... train with Adam optimizer, learning rate 0.0001, no weight decay or other explicit regularization... we adopt b = 20 and T = 15 for the linear regression task. |