OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step
Authors: Owen Dugan, Donato Jiménez-Benetó, Charlotte Loh, Zhuo Chen, Rumen Dangovski, Marin Soljacic
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our implementation using Llama 3 with Occam Net as a symbolic model (Occam Llama) achieves 100% accuracy on single arithmetic operations (+, , , , sin , cos , log , exp , ), outperforming GPT 4o with and without a code interpreter. Furthermore, Occam Llama outperforms GPT 4o with and without a code interpreter on average across a range of mathematical problem solving benchmarks, demonstrating that Occam LLMs can excel in arithmetic tasks, even surpassing much larger models. |
| Researcher Affiliation | Academia | Owen Dugan Department of Physics Massachusetts Institute of Technology Cambridge, MA odugan@mit.edu; Donato M. Jiménez-Benetó Department of Physics Massachusetts Institute of Technology Cambridge, MA donatojb@mit.edu; Charlotte Loh Department of EECS Massachusetts Institute of Technology Cambridge, MA cloh@mit.edu; Zhuo Chen Department of Physics Massachusetts Institute of Technology Cambridge, MA chenzhuo@mit.edu; Rumen Dangovski Department of EECS Massachusetts Institute of Technology Cambridge, MA rumenrd@mit.edu; Marin Soljaˇci c Department of Physics Massachusetts Institute of Technology Cambridge, MA soljacic@mit.edu |
| Pseudocode | No | The paper includes architectural diagrams and mathematical formulations, but no explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/druidowm/Occam LLM. |
| Open Datasets | Yes | We create synthetic datasets to train the Occam LLM decoders... which we modified from examples in the Multi Arith training dataset [33]. We evaluate our method and baselines on the following six benchmarks: Add Sub [41], GSM8K [42], Multi Arith [33], MATH401 [8], Single Eq [43], and SVAMP [44]. |
| Dataset Splits | Yes | For training the decoder that controls the weights of Occam Net, we created two types of examples, single queries and concatenated queries. For single queries, we select a single prompt from the problems generated as discussed in Section 3.2. ... To create the training dataset, each example is sampled by first randomly selecting whether to create a single or concatenated query... We created a training dataset consisting of a 80,000 examples split in 40,000 single queries and 40,000 sequences of concatenated queries. |
| Hardware Specification | Yes | For training and evaluation Occam Llama 8B, we used a single 32 GB NVIDIA Tesla V100 GPU. For Occam Llama 70B, we used two 80 GB NVIDIA A100 GPU. |
| Software Dependencies | Yes | For all Occam LLM results, we use Llama 3 8B Instruct and Llama 3 70B Instruct [35] as the underlying language models. ... We benchmark our methods against unmodified Llama 2 7B Chat (Llama 2 7B) [36], unmodified Llama 3 8B Instruct (Llama 3 8B) [35], unmodified Llama 3 70B Instruct (Llama 3708B) [35], gpt-3.5-turbo-0125 (GPT 3.5 Turbo) [37], gpt-4o-2024-05-13 (GPT 4o) [38], and gpt-4o-2024-05-13 with Code Interpreter (GPT 4o + Code) [39]. |
| Experiment Setup | Yes | For all 1-layer Occam Net training runs, we used a batch size of 1, a learning rate of 6e 4 and a weight decay parameter of 0.01. We use gradient accumulation to achieve an effective batch size of 8. We used a constant learning rate scheduler. We take 1000 samples from Occam Net per token. |