Pre-trained Large Language Models Use Fourier Features to Compute Addition
Authors: Tianyi Zhou, Deqing Fu, Vatsal Sharan, Robin Jia
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Unless otherwise stated, all experiments focus on the pre-trained GPT-2-XL model that has been fine-tuned on our addition dataset. |
| Researcher Affiliation | Academia | Department of Computer Science University of Southern California Los Angeles, CA 90089 {tzhou029,deqingfu,vsharan,robinjia}@usc.edu |
| Pseudocode | No | No pseudocode or algorithm block is provided. The paper includes formal definitions and mathematical descriptions of concepts like Fourier Basis and DFT, but these are not structured as pseudocode or algorithms. |
| Open Source Code | No | The paper states: "The goal of this paper is to understand how LLMs compute addition. We believe the code is not central to our contribution." and lists existing open-source models (GPT-2, GPT-J, Phi2) that they used, but does not provide their own implementation code for the described methodology. |
| Open Datasets | No | The paper states: "We constructed a synthetic addition dataset for fine-tuning and evaluation purposes." While the details of the dataset construction and splits (training 80%, validation 10%, test 10%) are provided, there is no link, DOI, or repository provided for public access to this constructed dataset. |
| Dataset Splits | Yes | The dataset is shuffled and then split into training (80%), validation (10%), and test (10%) sets. |
| Hardware Specification | Yes | All experiments involving fine-tuning and training from scratch in this paper were conducted on one NVIDIA A6000 GPU with 48GB of video memory. |
| Software Dependencies | No | The paper mentions using Huggingface for model checkpoints (e.g., GPT-2-XL, GPT-J, Phi2) but does not provide specific version numbers for general software dependencies like Python, PyTorch, or other libraries used for running the experiments. |
| Experiment Setup | Yes | We finetune GPT-2-XL on the language-math-dataset with 50 epochs and a batch size of 16. The dataset consists of 27, 400 training samples, 3, 420 validation samples, and 3, 420 test samples. We use the Adam W optimizer, scheduling the learning rate linearly from 1 10 5 to 0 without warmup. |