How do Transformers Perform In-Context Autoregressive Learning ?
Authors: Michael Eli Sander, Raja Giryes, Taiji Suzuki, Mathieu Blondel, Gabriel Peyré
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the experimental side, we consider the general case of non-commuting orthogonal matrices and generalize our theoretical findings. |
| Researcher Affiliation | Collaboration | 1 Ecole Normale Sup erieure and CNRS, France 2 Tel Aviv University, Israel 3 University of Tokyo and RIKEN AIP, Japan 4 Google Deep Mind. |
| Pseudocode | No | The paper does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures. Methods are described in prose and mathematical formulations. |
| Open Source Code | Yes | Our code in Pytorch (Paszke et al., 2017) and JAX (Bradbury et al., 2018) is open-sourced at https: //github.com/michaelsdr/ical. |
| Open Datasets | Yes | We use the nltk package (Bird et al., 2009), and we employ classic literary works, specifically Moby Dick by Herman Melville sourced from Project Gutenberg. |
| Dataset Splits | No | The paper mentions generating a 'training' dataset and using 'another dataset' for testing, but it does not explicitly state the use of a 'validation' set or provide specific train/validation/test split percentages. For example: 'We generate a dataset with n = 214 sequences with Tmax = 50 and d = 5 (therefore et R15) for training. We test using another dataset with 210 sequences of the same shape.' |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models. It mentions training on 'a full Transformer' and discusses 'High Performance Computing Resource' in a general sense, but no concrete specifications are given. |
| Software Dependencies | No | The paper mentions software like Pytorch, JAX, and nltk, but it does not specify any version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We train for 2000 epochs with Adam (Kingma & Ba, 2014) and a learning rate of 5 10 3 to minimize the mean squared error (MSE)... |