Teaching Arithmetic to Small Transformers
Authors: Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This study investigates how even small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective. We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. |
| Researcher Affiliation | Academia | Nayoung Lee University of Wisconsin-Madison nayoung.lee@wisc.edu Kartik Sreenivasan University of Wisconsin-Madison ksreenivasa2@wisc.edu Jason D. Lee Princeton University jasonlee@princeton.edu Kangwook Lee University of Wisconsin-Madison kangwook.lee@wisc.edu Dimitris Papailiopoulos University of Wisconsin-Madison dimitris@papail.io |
| Pseudocode | Yes | We present the full pseudo-code in Algorithm 1. |
| Open Source Code | Yes | Our code is available at https://github.com/lee-ny/teaching_arithmetic |
| Open Datasets | Yes | For arithmetic tasks like addition, subtraction, and multiplication, we define the training dataset for a binary operator f( ) as Dtrain = {(ai, bi), yi}N i=1 where yi = f(ai, bi). ... We use the Shakespeare dataset (Karpathy, 2015) that includes 1, 115, 394 tokens of text... |
| Dataset Splits | Yes | The learning rate is chosen from {1e-3, 5e-4, 1e-4, 5e-5} based on validation loss. |
| Hardware Specification | Yes | All of our experiments on Nano GPT and GPT-2 models are run using Py Torch 2.1 and CUDA 11.7 on Nvidia 2808 TIs and NVIDIA 3090s. |
| Software Dependencies | Yes | All of our experiments on Nano GPT and GPT-2 models are run using Py Torch 2.1 and CUDA 11.7 on Nvidia 2808 TIs and NVIDIA 3090s. |
| Experiment Setup | Yes | In this section, we provide a detailed description of our experimental setup, including the model architecture and an overview of the various data formatting and sampling techniques used. ... Table 16: Hyper Parameters used for Nano GPT experiments on arithmetic tasks ... Table 17: Hyper Parameters used for GPT-2 experiments on arithmetic tasks |