Understanding Addition in Transformers

Authors: Philip Quirke, Fazl Barez

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper provides a comprehensive analysis of a one-layer Transformer model trained to perform n-digit integer addition. Our findings suggest that the model dissects the task into parallel streams dedicated to individual digits, employing varied algorithms tailored to different positions within the digits. Furthermore, we identify a rare scenario characterized by high loss, which we explain. By thoroughly elucidating the model s algorithm, we provide new insights into its functioning. These findings are validated through rigorous testing and mathematical modeling, thereby contributing to the broader fields of model understanding and interpretability.Our results demonstrate the transformer s unique approach applies to integer addition across various digit lengths (see Appendixes B and C).
Researcher Affiliation Collaboration Philip Quirke Apart Research Fazl Barez Apart Research University of Oxford
Pseudocode Yes G APPENDIX MODEL ALGORITHM AS PSEUDOCODE
Open Source Code Yes To facilitate the reproduction of our empirical results on understanding and interpreting addition in one-layer transformers, and further studying the properties of more complext transformers on more complex tasks that would build on a single layer, we release all our code and resources used in this work.
Open Datasets No The paper mentions generating training data ('new batch of data each training step', '1.5 million training datums') but does not provide specific access information (link, citation, or repository) to this generated dataset to make it publicly available.
Dataset Splits No The paper mentions training data and test questions but does not explicitly provide information on dataset splits for training, validation, and testing with specific percentages or counts.
Hardware Specification Yes It runs on a T4 GPU with each experiment taking a few minutes to run.
Software Dependencies No The paper mentions 'A Colab notebook was used for experimentation' but does not specify any software dependencies with version numbers (e.g., Python, PyTorch, or other libraries).
Experiment Setup Yes The key parameters (which can all be altered) are: 1. n_layers = 1; This is a one layer Transformer 2. n_heads = 3; There are 3 attention heads 3. n_digits = 5; Number of digits in the addition question.During a training run the model processes about 1.5 million training datums.To speed up training, the data generator was enhanced to increase the frequency of these cases in the training data.