Transformers Can Do Arithmetic with the Right Embeddings

Authors: Sean McLeish, Arpit Bansal, Alex Stein, Neel Jain, John Kirchenbauer, Brian Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Jonas Geiping, Avi Schwarzschild, Tom Goldstein

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We find that training on only 20 digit numbers with a single GPU for one day, we can reach state-of-the-art performance, achieving up to 99% accuracy on 100 digit addition problems.
Researcher Affiliation Collaboration 1 University of Maryland, 2 Lawrence Livermore National Laboratory, 3 ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, Tübingen AI Center, 4 Carnegie Mellon University
Pseudocode No The paper describes methods but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Code available on Git Hub: github.com/mcleish7/arithmetic.
Open Datasets Yes We will release all code and datasets on Git Hub with an MIT License.
Dataset Splits No The paper describes training and testing categories (in-distribution, out-of-distribution) but does not explicitly mention a separate validation dataset split.
Hardware Specification Yes a single Nvidia RTXA4000 GPU for 24 hours; Our testing pipeline for addition and Bitise OR uses Nvidia V100 GPUs.
Software Dependencies No The paper mentions software components like PyTorch and Adam W but does not provide specific version numbers for them or other key software dependencies.
Experiment Setup Yes We detail what we believe to be an important subset of the default hyperparameter values in Table 5. A full list of all hyperparameters and model configurations is contained in the code release. For multiplication models with FIRE embeddings, the learning rate is 0.00006, due to large instabilities in higher learning rates which were not experienced for the Abacus Embeddings. (Table 5 lists: Hidden Size 1024, Intermediate Size 2048, Embedding Size 1024, Number of Attention Heads 16, Learning Rate 0.0001, etc.)