reproducibilityindex.ai

Teaching Arithmetic to Small Transformers

Authors: Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This study investigates how even small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective. We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy.
Researcher Affiliation	Academia	Nayoung Lee University of Wisconsin-Madison nayoung.lee@wisc.edu Kartik Sreenivasan University of Wisconsin-Madison ksreenivasa2@wisc.edu Jason D. Lee Princeton University jasonlee@princeton.edu Kangwook Lee University of Wisconsin-Madison kangwook.lee@wisc.edu Dimitris Papailiopoulos University of Wisconsin-Madison dimitris@papail.io
Pseudocode	Yes	We present the full pseudo-code in Algorithm 1.
Open Source Code	Yes	Our code is available at https://github.com/lee-ny/teaching_arithmetic
Open Datasets	Yes	For arithmetic tasks like addition, subtraction, and multiplication, we define the training dataset for a binary operator f( ) as Dtrain = {(ai, bi), yi}N i=1 where yi = f(ai, bi). ... We use the Shakespeare dataset (Karpathy, 2015) that includes 1, 115, 394 tokens of text...
Dataset Splits	Yes	The learning rate is chosen from {1e-3, 5e-4, 1e-4, 5e-5} based on validation loss.
Hardware Specification	Yes	All of our experiments on Nano GPT and GPT-2 models are run using Py Torch 2.1 and CUDA 11.7 on Nvidia 2808 TIs and NVIDIA 3090s.
Software Dependencies	Yes	All of our experiments on Nano GPT and GPT-2 models are run using Py Torch 2.1 and CUDA 11.7 on Nvidia 2808 TIs and NVIDIA 3090s.
Experiment Setup	Yes	In this section, we provide a detailed description of our experimental setup, including the model architecture and an overview of the various data formatting and sampling techniques used. ... Table 16: Hyper Parameters used for Nano GPT experiments on arithmetic tasks ... Table 17: Hyper Parameters used for GPT-2 experiments on arithmetic tasks