Interpreting and Improving Large Language Models in Arithmetic Calculation
Authors: Wei Zhang, Chaoqun Wan, Yonggang Zhang, Yiu-Ming Cheung, Xinmei Tian, Xu Shen, Jieping Ye
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through comprehensive experiments, we find that LLMs frequently involve a small fraction (< 5%) of attention heads, which play a pivotal role in focusing on operands and operators during calculation processes. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China 2Alibaba Cloud 3Hong Kong Baptist University 4Institute of Artificial Intelligence, Hefei Comprehensive National Science Center. |
| Pseudocode | Yes | Algorithm 1 Identifying Key Components |
| Open Source Code | No | The paper mentions using publicly available LLMs (LLaMA2 series from Hugging Face) but does not provide a link or explicit statement about releasing the source code for their own methodology or implementation. |
| Open Datasets | Yes | We evaluate precise SFT on four mathematical datasets (GSM8K (Cobbe et al., 2021), Add Sub (Hosseini et al., 2014), Single Eq (Koncel Kedziorski et al., 2015), SVAMP (Patel et al., 2021)), and another two datasets (MMLU (Hendrycks et al., 2020) and CSQA (Saha et al., 2018)) to evaluate the generic ability. |
| Dataset Splits | No | The paper mentions using and creating datasets for training and evaluation but does not specify exact train/validation/test splits, percentages, or sample counts, nor does it explicitly refer to using standard predefined splits for the public datasets. |
| Hardware Specification | Yes | All experiments are conducted on 8 NVIDIA A100 80GB GPUs. |
| Software Dependencies | No | The paper mentions using LLaMA2-7B and LLaMA2-13B models but does not provide specific version numbers for software dependencies such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or other relevant libraries. |
| Experiment Setup | Yes | In practice, we train LLa MA2-7B and LLa MA2-13B with a learning rate of 2 10 5 and a batch size of 128 for 2 epochs. The warm up ratio and weight decay are set as 0.02 and 0.1 by default, respectively. |