reproducibilityindex.ai

Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving

Authors: Aniket Didolkar, Anirudh Goyal, Nan Rosemary Ke, Siyuan Guo, Michal Valko, Timothy Lillicrap, Danilo Jimenez Rezende, Yoshua Bengio, Michael C. Mozer, Sanjeev Arora

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation focused on three distinct areas: Text-based Prompts: We utilized chain-of-thought prompting, as detailed in Section 4.1. This method involves providing step-by-step reasoning in the prompt to guide the model s thought process, Program-based Prompts: Here, we employed program-aided language models (PALs), described in Section 4.2. PALs integrate programming logic within the language model, aiming to enhance its reasoning capabilities, and Transferability: We investigate the generalizability of these skills across different LLMs and datasets, as elaborated in Section 4.3. This aspect tests how well the skills transfer to different LLM models and unseen datasets. Our results demonstrate that knowledge of skills significantly improves performance for both text-based and program-based prompting across different datasets.
Researcher Affiliation	Collaboration	Aniket Didolkar 1, Anirudh Goyal 1, Nan Rosemary Ke 4, Siyuan Guo 3, 5, Michal Valko 4, Timothy Lillicrap 4, Danilo Rezende 4, Yoshua Bengio 1, Michael Mozer 4, Sanjeev Arora 2 01 Mila, University of Montreal, 2 Princeton University, 3 The University of Cambridge, 4 Google Deep Mind 5 Max Planck Institute for Intelligent Systems
Pseudocode	No	The paper describes its methodology through figures and text, but it does not contain any formally structured pseudocode blocks or algorithms labeled as such.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The code will be made publicly available later.
Open Datasets	Yes	We start with the GSM8K dataset [33], which comprises grade-school level math problems. We then move on to the challenging MATH dataset [16], known for its competition-level problems. To examine the transferability of skills, we apply the skills from the GSM8K dataset to other math word problem datasets. These include SVAMP [15], ASDIV [38], and the MAWPS suite (Single OP, Single EQ, Add Sub, Multi Arith) [39].
Dataset Splits	No	The paper specifies training and test set sizes (e.g., 'GSM8K dataset [33] contains 7.5k training problems and 1k test problems.' and 'Its training set has 7.5k examples and the test set has 5k examples'), but it does not explicitly mention a separate validation dataset split.
Hardware Specification	Yes	We use 1 A100L GPU for this experiment.
Software Dependencies	No	The paper mentions using specific LLMs like GPT-4-0613, GPT-3.5-Turbo, and Mixtral-8x7B, and indicates Python for PALs. However, it does not provide specific version numbers for any ancillary software libraries or frameworks like PyTorch, TensorFlow, or scikit-learn.
Experiment Setup	Yes	All experiments were carried out using GPT-4-0613, employing 8-shot prompting and a decoding temperature set to 1.0. The results are displayed in Table 5. We use 1 A100L GPU for this experiment. For each problem, 4 in-context examples are chosen based on skill-matching, and outputs are sampled with a decoding temperature of 0.2. We employ a Co T-based method with 4-shot prompting and greedy decoding, aligning with the baseline settings.