Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Transformers Learn In-Context by Gradient Descent

Authors: Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, Joao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, Max Vladymyrov

ICML 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental show empirically that when training self-attention-only Transformers on simple regression tasks either the models learned by GD and Transformers show great similarity or, remarkably, the weights found by optimization match the construction.
Researcher Affiliation Collaboration 1Department of Computer Science, ETH Z urich, Z urich, Switzerland 2Google Research.
Pseudocode No The paper describes methods through prose and equations, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Main experiments can be reproduced with notebooks provided under the following link: https://github.com/google-research/self-organising-systems/tree/master/transformers_learn_icl_by_gd
Open Datasets No We focus on solvable tasks and similarly to Garg et al. (2022) generate data for each task using a teacher model with parameters Wτ N(0, I). We then sample xτ,i U(-1, 1)n I and construct targets using the task-specific teacher model, yτ,i = Wτxτ,i.
Dataset Splits Yes More concretely, to compare trained and constructed LSA layers, we sample Tval = 10^4 validation tasks and record the following quantities, averaged over validation tasks
Hardware Specification No The paper does not provide specific details regarding the hardware used for experiments, such as CPU/GPU models or memory specifications.
Software Dependencies No The paper mentions software like Adam, Optax, and Haiku, but does not provide specific version numbers for these dependencies, which are necessary for full reproducibility.
Experiment Setup Yes Optimizer: Adam (Kingma & Ba, 2014) with default parameters and learning rate of 0.001 for Transformer with depth K < 3 and 0.0005 otherwise. We use a batchsize of 2048 and applied gradient clipping to obtain gradients with global norm of 10.