reproducibilityindex.ai

Transformers Learn In-Context by Gradient Descent

Authors: Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, Joao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, Max Vladymyrov

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	show empirically that when training self-attention-only Transformers on simple regression tasks either the models learned by GD and Transformers show great similarity or, remarkably, the weights found by optimization match the construction.
Researcher Affiliation	Collaboration	1Department of Computer Science, ETH Z urich, Z urich, Switzerland 2Google Research.
Pseudocode	No	The paper describes methods through prose and equations, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Main experiments can be reproduced with notebooks provided under the following link: https://github.com/google-research/self-organising-systems/tree/master/transformers_learn_icl_by_gd
Open Datasets	No	We focus on solvable tasks and similarly to Garg et al. (2022) generate data for each task using a teacher model with parameters Wτ N(0, I). We then sample xτ,i U(-1, 1)n I and construct targets using the task-specific teacher model, yτ,i = Wτxτ,i.
Dataset Splits	Yes	More concretely, to compare trained and constructed LSA layers, we sample Tval = 10^4 validation tasks and record the following quantities, averaged over validation tasks
Hardware Specification	No	The paper does not provide specific details regarding the hardware used for experiments, such as CPU/GPU models or memory specifications.
Software Dependencies	No	The paper mentions software like Adam, Optax, and Haiku, but does not provide specific version numbers for these dependencies, which are necessary for full reproducibility.
Experiment Setup	Yes	Optimizer: Adam (Kingma & Ba, 2014) with default parameters and learning rate of 0.001 for Transformer with depth K < 3 and 0.0005 otherwise. We use a batchsize of 2048 and applied gradient clipping to obtain gradients with global norm of 10.