Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving
Authors: Aniket Didolkar, Anirudh Goyal, Nan Rosemary Ke, Siyuan Guo, Michal Valko, Timothy Lillicrap, Danilo Jimenez Rezende, Yoshua Bengio, Michael C. Mozer, Sanjeev Arora
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation focused on three distinct areas: Text-based Prompts: We utilized chain-of-thought prompting, as detailed in Section 4.1. This method involves providing step-by-step reasoning in the prompt to guide the model s thought process, Program-based Prompts: Here, we employed program-aided language models (PALs), described in Section 4.2. PALs integrate programming logic within the language model, aiming to enhance its reasoning capabilities, and Transferability: We investigate the generalizability of these skills across different LLMs and datasets, as elaborated in Section 4.3. This aspect tests how well the skills transfer to different LLM models and unseen datasets. Our results demonstrate that knowledge of skills significantly improves performance for both text-based and program-based prompting across different datasets. |
| Researcher Affiliation | Collaboration | Aniket Didolkar 1, Anirudh Goyal 1, Nan Rosemary Ke 4, Siyuan Guo 3, 5, Michal Valko 4, Timothy Lillicrap 4, Danilo Rezende 4, Yoshua Bengio 1, Michael Mozer 4, Sanjeev Arora 2 01 Mila, University of Montreal, 2 Princeton University, 3 The University of Cambridge, 4 Google Deep Mind 5 Max Planck Institute for Intelligent Systems |
| Pseudocode | No | The paper describes its methodology through figures and text, but it does not contain any formally structured pseudocode blocks or algorithms labeled as such. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The code will be made publicly available later. |
| Open Datasets | Yes | We start with the GSM8K dataset [33], which comprises grade-school level math problems. We then move on to the challenging MATH dataset [16], known for its competition-level problems. To examine the transferability of skills, we apply the skills from the GSM8K dataset to other math word problem datasets. These include SVAMP [15], ASDIV [38], and the MAWPS suite (Single OP, Single EQ, Add Sub, Multi Arith) [39]. |
| Dataset Splits | No | The paper specifies training and test set sizes (e.g., 'GSM8K dataset [33] contains 7.5k training problems and 1k test problems.' and 'Its training set has 7.5k examples and the test set has 5k examples'), but it does not explicitly mention a separate validation dataset split. |
| Hardware Specification | Yes | We use 1 A100L GPU for this experiment. |
| Software Dependencies | No | The paper mentions using specific LLMs like GPT-4-0613, GPT-3.5-Turbo, and Mixtral-8x7B, and indicates Python for PALs. However, it does not provide specific version numbers for any ancillary software libraries or frameworks like PyTorch, TensorFlow, or scikit-learn. |
| Experiment Setup | Yes | All experiments were carried out using GPT-4-0613, employing 8-shot prompting and a decoding temperature set to 1.0. The results are displayed in Table 5. We use 1 A100L GPU for this experiment. For each problem, 4 in-context examples are chosen based on skill-matching, and outputs are sampled with a decoding temperature of 0.2. We employ a Co T-based method with 4-shot prompting and greedy decoding, aligning with the baseline settings. |