Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Eliciting Reasoning in Language Models with Cognitive Tools

Authors: Brown Wilfried Ebouky Doualla Dina, Andrea Bartezzaghi, Mattia Rigotti

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 ExperimentsDatasets Following established evaluation practices in the reasoning literature [Hendrycks et al., 2021], in this work we investigate the elicitation of reasoning using cognitive tools on math-oriented benchmarks. We focus our experiments on math benchmarks because of how reasoning is central in solving math problems. Specifically, we consider the following datasets: AIME 2024 [MAA, 2024] is a dataset that contains 30 samples... MATH 500 [Hendrycks et al., 2021] contains 500 math problems... AMC [Li et al., 2024] is a curated collection of 83 problems... Smolagents Benchmark-v1 [Huggingface, 2024] is composed of questions... Models We use the open-weight models Qwen2.5-(7B, 32B) Instruct [Qwen Team, 2024], Llama3.1-8B Instruct, and Llama3.3-70B Instruct [AI@Meta, 2024]. We also experiment with closed models GPT-4.1 and o1-preview. Evaluation and Baselines In all experiments, we report the model s accuracy in providing the correct answer on the first try (pass@1). For AIME 2024 [MAA, 2024] and AMC [Li et al., 2024], the answer from the model is compared to the ground truth via parsing. Regarding MATH500 [Hendrycks et al., 2021], which includes more elaborated answers that are not just numerical (e.g., complex expressions), we use an LLM-as-a-judge approach to establish the veracity of the answers [Zheng et al., 2023]. Specifically, we use GPT-4.1 as a judge and report the accuracy of the model in answering the questions (see the prompt used for the judge LLM in the Appendix).
Researcher Affiliation	Collaboration	Brown Ebouky IBM Research Zurich ETH Zurich EMAIL Bartezzaghi IBM Research Zurich EMAIL Rigotti IBM Research Zurich EMAIL
Pseudocode	Yes	We provide pseudo-code of our cognitive tools pipeline in the Appendix. A.4 Pseudo-code of cognitive tools pipeline We report in Algorithm 1 pseudo-code illustrating how tools interact with the main LLM loop in our cognitive tools pipeline. Algorithm 1 LLM-Orchestrated Reasoning with Cognitive Tools 1: Initialize context {question: question, history: [ ]} 2: while True do 3: response LLM(prompt = "Cognitive Tools Prompt", context) 4: if response["action"] = "answer" then 5: return response["answer"] 6: else if response["action"] = "call_tool" then 7: tool_input response["tool_input"] 8: tool_name response["tool_name"] 9: tool_output LLM(prompt = "Tool Prompt", inputs = tool_input) 10: context["history"].append({tool_call : tool_input, tool_output : tool_output}) 11: end if 12: end while
Open Source Code	No	Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Yes, everything needed to reproduce results is disclosed: empirical validations are carried out on publicly available datasets, and we plan to release code to reproduce results upon acceptance. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The datasets that were used are publicly available and code, but code will be released upon acceptance.
Open Datasets	Yes	Specifically, we consider the following datasets: AIME 2024 [MAA, 2024] is a dataset that contains 30 samples... MATH 500 [Hendrycks et al., 2021] contains 500 math problems... AMC [Li et al., 2024] is a curated collection of 83 problems... Smolagents Benchmark-v1 [Huggingface, 2024] is composed of questions about different tasks, such as math or question answering from Hugging Face.
Dataset Splits	No	The paper does not provide specific train/test/validation dataset splits. It mentions the total number of samples for each dataset (e.g., AIME 2024 contains 30 samples, MATH 500 contains 500 math problems, AMC is 83 problems, Smolbenchmark is 50 samples) and evaluates 'pass@1' accuracy, but does not detail how these datasets are partitioned into training, validation, or test sets for reproducibility of the splitting.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. Although the NeurIPS checklist indicates 'Yes, we describe our hardware setup in the text', no such description is found within the paper's content.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9). It mentions the LLM models used (Qwen2.5, Llama3.1, GPT-4.1) and implicitly Python for the code tool, but lacks explicit version details for development environments or libraries.
Experiment Setup	Yes	A.1 Model Inference Hyper-parameters We use the default hyper-parameters provided in the model configurations for all the models that we considered in our experiments. For instance, for Qwen2.5-(7B,32B)-Instruct we use a temperature of 0.7, top-p of 0.8 and top-k of 20. As for Llama-3.1-(8, 70)B-Instruct, the temperature is of 0.6, and top-p of 0.9. A.2 Baseline We establish our baseline on Qwen2.5-(7B, 32B) Instruct, Llama3.1-8B Instruct, Llama3.3-70B Instruct and GPT-4.1 models by prompting the LLM with the question we want to have an answer for. We only append the sentence: "Solve the math problem: " to each question and we do not change the system prompt of the model. A.3 Cognitive Prompting For the cognitive prompting strategy, we use the prompt released in Kramer and Baumann [2024], which is as follows: Cognitive Prompting (prompt) Solve the following math problem by following each step of cognitive operations from the list below. A.5 Cogntive Tool Prompts As explained in the main text, the cognitive tools that we introduce are implemented in a modular fashion. Each cognitive tool is implemented as a call to an LLM (same as the original one) but with a specific prompt tailored to the specifics of the tool. Below we present the prompt used for each cognitive tool: