ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving
Authors: Zhibin Gou, Zhihong Shao, Yeyun Gong, yelong shen, Yujiu Yang, Minlie Huang, Nan Duan, Weizhu Chen
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the resulting suite of Tool-integrated Reasoning Agents (TORA) ranging from 7B to 70B on 10 diverse mathematical reasoning datasets. As shown in Fig 1, TORA series significantly outperform open-source models across all scales. Notably, on the competition-level MATH dataset, TORA-7B outperforms the previous So TA Wizard Math-70B (Luo et al., 2023) by 22% absolute. TORA-CODE-34B beats GPT-4 s Co T result (Bubeck et al., 2023) by 8.3% absolute (50.8% vs. 42.5%), and is competitive with GPT-4 solving problems with code (GPT-4-Code, 51.8%). |
| Researcher Affiliation | Collaboration | 1Tsinghua University 2Microsoft Research 3Microsoft Azure AI |
| Pseudocode | Yes | Algorithm 1 Inference of Tool-Integrated Reasoning |
| Open Source Code | Yes | Code and models are available at https://github.com/microsoft/To RA. |
| Open Datasets | Yes | We evaluated models on GSM8k (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021), along with 8 out-of-distribution datasets, namely GSM-Hard (Gao et al., 2022), SVAMP (Patel et al., 2021), ASDIV (Miao et al., 2020), Tab MWP (Lu et al., 2023), Single EQ, Single OP, Add Sub, and Multi Arith (Koncel-Kedziorski et al., 2016), as illustrated in Table 5 in Appendix. |
| Dataset Splits | No | The paper states that TORA-CORPUS only uses questions from the original training set of MATH and GSM8k, and evaluates on the datasets, but it does not specify explicit train/validation/test dataset splits with percentages or sample counts for reproduction. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions software like Deep Speed Ze RO Stage3 and Flash-Attention 2, and uses 'sympy' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We used a learning rate of 2e-5 by default except that we used 1e-5 for the 34B and 70B models. We set the global batch size to 128 and used a linear scheduler with a 3% warm-up period for 3 epochs. We used greedy decoding for all results, with the maximum sequence length set to 2,048 and the maximum number of tool executions set to 3. |