Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Interpreting Arithmetic Reasoning in Large Language Models using Game-Theoretic Interactions
Authors: θΎθΎ ζΈ©, Liwei Zheng, Hongda Li, Lijun Sun, Zhihua Wei, Wen Shen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we conduct comparative studies to analyze the internal mechanism of LLMs for arithmetic reasoning (see Section 4.1). We also fine-tune an LLM to improve its capability to solve arithmetic problems and explore how the LLM encodes different types of interactions during the training process (see Section 4.2). Table 1 shows the overall accuracy (%) of different LLMs on arithmetic queries. |
| Researcher Affiliation | Academia | 1Tongji University, Shanghai, China EMAIL |
| Pseudocode | No | The paper describes methods using mathematical equations and definitions (e.g., Definition 1, Definition 2, Equation 1, Equation 2), but no section or figure is explicitly labeled 'Pseudocode' or 'Algorithm', nor are there structured algorithmic steps presented in a code-like format. |
| Open Source Code | No | While the code and dataset generated are not yet ready for anonymous open sourcing, we plan to open-source the code and the data with appropriate licensing. |
| Open Datasets | Yes | We follow Karpas et al. [2022], Razeghi et al. [2022], Stolfo et al. [2023] to conduct experiments on a set of arithmetic problems hand-crafted by humans, including 6 templates for one-operator two-operand queries and 29 templates for two-operator three-operand queries. Please see Appendix F for details of templates. |
| Dataset Splits | Yes | For single-operator data, we use an 8/2 train-test split, while for two-operator data, we use a 9/1 train-test split. |
| Hardware Specification | Yes | We conducted our experiments on an NVIDIA Ge Force RTX 4090 24GB GPU. |
| Software Dependencies | No | The paper mentions several LLM models (e.g., OPT-1.3B, Llama-2-7B) and the LoRA method, but does not provide specific version numbers for ancillary software dependencies like Python, PyTorch, TensorFlow, or specific libraries used for implementation beyond the models themselves. |
| Experiment Setup | Yes | For the one-operator templates, we train the model for 10 epochs with a batch size of 16. For the two-operator templates, we train the model for 20 epochs with a batch size of 32. The training uses a learning rate of 8e-4 with a linear decay scheduler. The Lo RA configuration includes a rank of 8, a Lo RA alpha of 32, and a dropout of 0.05. |