reproducibilityindex.ai

OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step

Authors: Owen Dugan, Donato Jiménez-Benetó, Charlotte Loh, Zhuo Chen, Rumen Dangovski, Marin Soljacic

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our implementation using Llama 3 with Occam Net as a symbolic model (Occam Llama) achieves 100% accuracy on single arithmetic operations (+, , , , sin , cos , log , exp , ), outperforming GPT 4o with and without a code interpreter. Furthermore, Occam Llama outperforms GPT 4o with and without a code interpreter on average across a range of mathematical problem solving benchmarks, demonstrating that Occam LLMs can excel in arithmetic tasks, even surpassing much larger models.
Researcher Affiliation	Academia	Owen Dugan Department of Physics Massachusetts Institute of Technology Cambridge, MA odugan@mit.edu; Donato M. Jiménez-Benetó Department of Physics Massachusetts Institute of Technology Cambridge, MA donatojb@mit.edu; Charlotte Loh Department of EECS Massachusetts Institute of Technology Cambridge, MA cloh@mit.edu; Zhuo Chen Department of Physics Massachusetts Institute of Technology Cambridge, MA chenzhuo@mit.edu; Rumen Dangovski Department of EECS Massachusetts Institute of Technology Cambridge, MA rumenrd@mit.edu; Marin Soljaˇci c Department of Physics Massachusetts Institute of Technology Cambridge, MA soljacic@mit.edu
Pseudocode	No	The paper includes architectural diagrams and mathematical formulations, but no explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/druidowm/Occam LLM.
Open Datasets	Yes	We create synthetic datasets to train the Occam LLM decoders... which we modified from examples in the Multi Arith training dataset [33]. We evaluate our method and baselines on the following six benchmarks: Add Sub [41], GSM8K [42], Multi Arith [33], MATH401 [8], Single Eq [43], and SVAMP [44].
Dataset Splits	Yes	For training the decoder that controls the weights of Occam Net, we created two types of examples, single queries and concatenated queries. For single queries, we select a single prompt from the problems generated as discussed in Section 3.2. ... To create the training dataset, each example is sampled by first randomly selecting whether to create a single or concatenated query... We created a training dataset consisting of a 80,000 examples split in 40,000 single queries and 40,000 sequences of concatenated queries.
Hardware Specification	Yes	For training and evaluation Occam Llama 8B, we used a single 32 GB NVIDIA Tesla V100 GPU. For Occam Llama 70B, we used two 80 GB NVIDIA A100 GPU.
Software Dependencies	Yes	For all Occam LLM results, we use Llama 3 8B Instruct and Llama 3 70B Instruct [35] as the underlying language models. ... We benchmark our methods against unmodified Llama 2 7B Chat (Llama 2 7B) [36], unmodified Llama 3 8B Instruct (Llama 3 8B) [35], unmodified Llama 3 70B Instruct (Llama 3708B) [35], gpt-3.5-turbo-0125 (GPT 3.5 Turbo) [37], gpt-4o-2024-05-13 (GPT 4o) [38], and gpt-4o-2024-05-13 with Code Interpreter (GPT 4o + Code) [39].
Experiment Setup	Yes	For all 1-layer Occam Net training runs, we used a batch size of 1, a learning rate of 6e 4 and a weight decay parameter of 0.01. We use gradient accumulation to achieve an effective batch size of 8. We used a constant learning rate scheduler. We take 1000 samples from Occam Net per token.