ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

Authors: Zhibin Gou, Zhihong Shao, Yeyun Gong, yelong shen, Yujiu Yang, Minlie Huang, Nan Duan, Weizhu Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the resulting suite of Tool-integrated Reasoning Agents (TORA) ranging from 7B to 70B on 10 diverse mathematical reasoning datasets. As shown in Fig 1, TORA series significantly outperform open-source models across all scales. Notably, on the competition-level MATH dataset, TORA-7B outperforms the previous So TA Wizard Math-70B (Luo et al., 2023) by 22% absolute. TORA-CODE-34B beats GPT-4 s Co T result (Bubeck et al., 2023) by 8.3% absolute (50.8% vs. 42.5%), and is competitive with GPT-4 solving problems with code (GPT-4-Code, 51.8%).
Researcher Affiliation Collaboration 1Tsinghua University 2Microsoft Research 3Microsoft Azure AI
Pseudocode Yes Algorithm 1 Inference of Tool-Integrated Reasoning
Open Source Code Yes Code and models are available at https://github.com/microsoft/To RA.
Open Datasets Yes We evaluated models on GSM8k (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021), along with 8 out-of-distribution datasets, namely GSM-Hard (Gao et al., 2022), SVAMP (Patel et al., 2021), ASDIV (Miao et al., 2020), Tab MWP (Lu et al., 2023), Single EQ, Single OP, Add Sub, and Multi Arith (Koncel-Kedziorski et al., 2016), as illustrated in Table 5 in Appendix.
Dataset Splits No The paper states that TORA-CORPUS only uses questions from the original training set of MATH and GSM8k, and evaluates on the datasets, but it does not specify explicit train/validation/test dataset splits with percentages or sample counts for reproduction.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions software like Deep Speed Ze RO Stage3 and Flash-Attention 2, and uses 'sympy' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We used a learning rate of 2e-5 by default except that we used 1e-5 for the 34B and 70B models. We set the global batch size to 128 and used a linear scheduler with a 3% warm-up period for 3 epochs. We used greedy decoding for all results, with the maximum sequence length set to 2,048 and the maximum number of tool executions set to 3.