Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Interpreting Arithmetic Reasoning in Large Language Models using Game-Theoretic Interactions

Authors: 蕾蕾温, Liwei Zheng, Hongda Li, Lijun Sun, Zhihua Wei, Wen Shen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we conduct comparative studies to analyze the internal mechanism of LLMs for arithmetic reasoning (see Section 4.1). We also fine-tune an LLM to improve its capability to solve arithmetic problems and explore how the LLM encodes different types of interactions during the training process (see Section 4.2). Table 1 shows the overall accuracy (%) of different LLMs on arithmetic queries.
Researcher Affiliation	Academia	1Tongji University, Shanghai, China EMAIL
Pseudocode	No	The paper describes methods using mathematical equations and definitions (e.g., Definition 1, Definition 2, Equation 1, Equation 2), but no section or figure is explicitly labeled 'Pseudocode' or 'Algorithm', nor are there structured algorithmic steps presented in a code-like format.
Open Source Code	No	While the code and dataset generated are not yet ready for anonymous open sourcing, we plan to open-source the code and the data with appropriate licensing.
Open Datasets	Yes	We follow Karpas et al. [2022], Razeghi et al. [2022], Stolfo et al. [2023] to conduct experiments on a set of arithmetic problems hand-crafted by humans, including 6 templates for one-operator two-operand queries and 29 templates for two-operator three-operand queries. Please see Appendix F for details of templates.
Dataset Splits	Yes	For single-operator data, we use an 8/2 train-test split, while for two-operator data, we use a 9/1 train-test split.
Hardware Specification	Yes	We conducted our experiments on an NVIDIA Ge Force RTX 4090 24GB GPU.
Software Dependencies	No	The paper mentions several LLM models (e.g., OPT-1.3B, Llama-2-7B) and the LoRA method, but does not provide specific version numbers for ancillary software dependencies like Python, PyTorch, TensorFlow, or specific libraries used for implementation beyond the models themselves.
Experiment Setup	Yes	For the one-operator templates, we train the model for 10 epochs with a batch size of 16. For the two-operator templates, we train the model for 20 epochs with a batch size of 32. The training uses a learning rate of 8e-4 with a linear decay scheduler. The Lo RA configuration includes a rank of 8, a Lo RA alpha of 32, and a dropout of 0.05.