Transformers Can Do Arithmetic with the Right Embeddings
Authors: Sean McLeish, Arpit Bansal, Alex Stein, Neel Jain, John Kirchenbauer, Brian Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Jonas Geiping, Avi Schwarzschild, Tom Goldstein
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find that training on only 20 digit numbers with a single GPU for one day, we can reach state-of-the-art performance, achieving up to 99% accuracy on 100 digit addition problems. |
| Researcher Affiliation | Collaboration | 1 University of Maryland, 2 Lawrence Livermore National Laboratory, 3 ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, Tübingen AI Center, 4 Carnegie Mellon University |
| Pseudocode | No | The paper describes methods but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code available on Git Hub: github.com/mcleish7/arithmetic. |
| Open Datasets | Yes | We will release all code and datasets on Git Hub with an MIT License. |
| Dataset Splits | No | The paper describes training and testing categories (in-distribution, out-of-distribution) but does not explicitly mention a separate validation dataset split. |
| Hardware Specification | Yes | a single Nvidia RTXA4000 GPU for 24 hours; Our testing pipeline for addition and Bitise OR uses Nvidia V100 GPUs. |
| Software Dependencies | No | The paper mentions software components like PyTorch and Adam W but does not provide specific version numbers for them or other key software dependencies. |
| Experiment Setup | Yes | We detail what we believe to be an important subset of the default hyperparameter values in Table 5. A full list of all hyperparameters and model configurations is contained in the code release. For multiplication models with FIRE embeddings, the learning rate is 0.00006, due to large instabilities in higher learning rates which were not experienced for the Abacus Embeddings. (Table 5 lists: Hidden Size 1024, Intermediate Size 2048, Embedding Size 1024, Number of Attention Heads 16, Learning Rate 0.0001, etc.) |