reproducibilityindex.ai

Understanding Addition in Transformers

Authors: Philip Quirke, Fazl Barez

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper provides a comprehensive analysis of a one-layer Transformer model trained to perform n-digit integer addition. Our findings suggest that the model dissects the task into parallel streams dedicated to individual digits, employing varied algorithms tailored to different positions within the digits. Furthermore, we identify a rare scenario characterized by high loss, which we explain. By thoroughly elucidating the model s algorithm, we provide new insights into its functioning. These findings are validated through rigorous testing and mathematical modeling, thereby contributing to the broader fields of model understanding and interpretability.Our results demonstrate the transformer s unique approach applies to integer addition across various digit lengths (see Appendixes B and C).
Researcher Affiliation	Collaboration	Philip Quirke Apart Research Fazl Barez Apart Research University of Oxford
Pseudocode	Yes	G APPENDIX MODEL ALGORITHM AS PSEUDOCODE
Open Source Code	Yes	To facilitate the reproduction of our empirical results on understanding and interpreting addition in one-layer transformers, and further studying the properties of more complext transformers on more complex tasks that would build on a single layer, we release all our code and resources used in this work.
Open Datasets	No	The paper mentions generating training data ('new batch of data each training step', '1.5 million training datums') but does not provide specific access information (link, citation, or repository) to this generated dataset to make it publicly available.
Dataset Splits	No	The paper mentions training data and test questions but does not explicitly provide information on dataset splits for training, validation, and testing with specific percentages or counts.
Hardware Specification	Yes	It runs on a T4 GPU with each experiment taking a few minutes to run.
Software Dependencies	No	The paper mentions 'A Colab notebook was used for experimentation' but does not specify any software dependencies with version numbers (e.g., Python, PyTorch, or other libraries).
Experiment Setup	Yes	The key parameters (which can all be altered) are: 1. n_layers = 1; This is a one layer Transformer 2. n_heads = 3; There are 3 attention heads 3. n_digits = 5; Number of digits in the addition question.During a training run the model processes about 1.5 million training datums.To speed up training, the data generator was enhanced to increase the frequency of these cases in the training data.