reproducibilityindex.ai

Interpreting and Improving Large Language Models in Arithmetic Calculation

Authors: Wei Zhang, Chaoqun Wan, Yonggang Zhang, Yiu-Ming Cheung, Xinmei Tian, Xu Shen, Jieping Ye

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through comprehensive experiments, we find that LLMs frequently involve a small fraction (< 5%) of attention heads, which play a pivotal role in focusing on operands and operators during calculation processes.
Researcher Affiliation	Collaboration	1University of Science and Technology of China 2Alibaba Cloud 3Hong Kong Baptist University 4Institute of Artificial Intelligence, Hefei Comprehensive National Science Center.
Pseudocode	Yes	Algorithm 1 Identifying Key Components
Open Source Code	No	The paper mentions using publicly available LLMs (LLaMA2 series from Hugging Face) but does not provide a link or explicit statement about releasing the source code for their own methodology or implementation.
Open Datasets	Yes	We evaluate precise SFT on four mathematical datasets (GSM8K (Cobbe et al., 2021), Add Sub (Hosseini et al., 2014), Single Eq (Koncel Kedziorski et al., 2015), SVAMP (Patel et al., 2021)), and another two datasets (MMLU (Hendrycks et al., 2020) and CSQA (Saha et al., 2018)) to evaluate the generic ability.
Dataset Splits	No	The paper mentions using and creating datasets for training and evaluation but does not specify exact train/validation/test splits, percentages, or sample counts, nor does it explicitly refer to using standard predefined splits for the public datasets.
Hardware Specification	Yes	All experiments are conducted on 8 NVIDIA A100 80GB GPUs.
Software Dependencies	No	The paper mentions using LLaMA2-7B and LLaMA2-13B models but does not provide specific version numbers for software dependencies such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or other relevant libraries.
Experiment Setup	Yes	In practice, we train LLa MA2-7B and LLa MA2-13B with a learning rate of 2 10 5 and a batch size of 128 for 2 epochs. The warm up ratio and weight decay are set as 0.02 and 0.1 by default, respectively.