Teaching Arithmetic to Small Transformers

Authors: Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This study investigates how even small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective. We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy.
Researcher Affiliation Academia Nayoung Lee University of Wisconsin-Madison nayoung.lee@wisc.edu Kartik Sreenivasan University of Wisconsin-Madison ksreenivasa2@wisc.edu Jason D. Lee Princeton University jasonlee@princeton.edu Kangwook Lee University of Wisconsin-Madison kangwook.lee@wisc.edu Dimitris Papailiopoulos University of Wisconsin-Madison dimitris@papail.io
Pseudocode Yes We present the full pseudo-code in Algorithm 1.
Open Source Code Yes Our code is available at https://github.com/lee-ny/teaching_arithmetic
Open Datasets Yes For arithmetic tasks like addition, subtraction, and multiplication, we define the training dataset for a binary operator f( ) as Dtrain = {(ai, bi), yi}N i=1 where yi = f(ai, bi). ... We use the Shakespeare dataset (Karpathy, 2015) that includes 1, 115, 394 tokens of text...
Dataset Splits Yes The learning rate is chosen from {1e-3, 5e-4, 1e-4, 5e-5} based on validation loss.
Hardware Specification Yes All of our experiments on Nano GPT and GPT-2 models are run using Py Torch 2.1 and CUDA 11.7 on Nvidia 2808 TIs and NVIDIA 3090s.
Software Dependencies Yes All of our experiments on Nano GPT and GPT-2 models are run using Py Torch 2.1 and CUDA 11.7 on Nvidia 2808 TIs and NVIDIA 3090s.
Experiment Setup Yes In this section, we provide a detailed description of our experimental setup, including the model architecture and an overview of the various data formatting and sampling techniques used. ... Table 16: Hyper Parameters used for Nano GPT experiments on arithmetic tasks ... Table 17: Hyper Parameters used for GPT-2 experiments on arithmetic tasks