Interpreting and Improving Large Language Models in Arithmetic Calculation

Authors: Wei Zhang, Chaoqun Wan, Yonggang Zhang, Yiu-Ming Cheung, Xinmei Tian, Xu Shen, Jieping Ye

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through comprehensive experiments, we find that LLMs frequently involve a small fraction (< 5%) of attention heads, which play a pivotal role in focusing on operands and operators during calculation processes.
Researcher Affiliation Collaboration 1University of Science and Technology of China 2Alibaba Cloud 3Hong Kong Baptist University 4Institute of Artificial Intelligence, Hefei Comprehensive National Science Center.
Pseudocode Yes Algorithm 1 Identifying Key Components
Open Source Code No The paper mentions using publicly available LLMs (LLaMA2 series from Hugging Face) but does not provide a link or explicit statement about releasing the source code for their own methodology or implementation.
Open Datasets Yes We evaluate precise SFT on four mathematical datasets (GSM8K (Cobbe et al., 2021), Add Sub (Hosseini et al., 2014), Single Eq (Koncel Kedziorski et al., 2015), SVAMP (Patel et al., 2021)), and another two datasets (MMLU (Hendrycks et al., 2020) and CSQA (Saha et al., 2018)) to evaluate the generic ability.
Dataset Splits No The paper mentions using and creating datasets for training and evaluation but does not specify exact train/validation/test splits, percentages, or sample counts, nor does it explicitly refer to using standard predefined splits for the public datasets.
Hardware Specification Yes All experiments are conducted on 8 NVIDIA A100 80GB GPUs.
Software Dependencies No The paper mentions using LLaMA2-7B and LLaMA2-13B models but does not provide specific version numbers for software dependencies such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or other relevant libraries.
Experiment Setup Yes In practice, we train LLa MA2-7B and LLa MA2-13B with a learning rate of 2 10 5 and a batch size of 128 for 2 epochs. The warm up ratio and weight decay are set as 0.02 and 0.1 by default, respectively.