Pre-trained Large Language Models Use Fourier Features to Compute Addition

Authors: Tianyi Zhou, Deqing Fu, Vatsal Sharan, Robin Jia

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Unless otherwise stated, all experiments focus on the pre-trained GPT-2-XL model that has been fine-tuned on our addition dataset.
Researcher Affiliation Academia Department of Computer Science University of Southern California Los Angeles, CA 90089 {tzhou029,deqingfu,vsharan,robinjia}@usc.edu
Pseudocode No No pseudocode or algorithm block is provided. The paper includes formal definitions and mathematical descriptions of concepts like Fourier Basis and DFT, but these are not structured as pseudocode or algorithms.
Open Source Code No The paper states: "The goal of this paper is to understand how LLMs compute addition. We believe the code is not central to our contribution." and lists existing open-source models (GPT-2, GPT-J, Phi2) that they used, but does not provide their own implementation code for the described methodology.
Open Datasets No The paper states: "We constructed a synthetic addition dataset for fine-tuning and evaluation purposes." While the details of the dataset construction and splits (training 80%, validation 10%, test 10%) are provided, there is no link, DOI, or repository provided for public access to this constructed dataset.
Dataset Splits Yes The dataset is shuffled and then split into training (80%), validation (10%), and test (10%) sets.
Hardware Specification Yes All experiments involving fine-tuning and training from scratch in this paper were conducted on one NVIDIA A6000 GPU with 48GB of video memory.
Software Dependencies No The paper mentions using Huggingface for model checkpoints (e.g., GPT-2-XL, GPT-J, Phi2) but does not provide specific version numbers for general software dependencies like Python, PyTorch, or other libraries used for running the experiments.
Experiment Setup Yes We finetune GPT-2-XL on the language-math-dataset with 50 epochs and a batch size of 16. The dataset consists of 27, 400 training samples, 3, 420 validation samples, and 3, 420 test samples. We use the Adam W optimizer, scheduling the learning rate linearly from 1 10 5 to 0 without warmup.