reproducibilityindex.ai

Reprompting: Automated Chain-of-Thought Prompt Inference Through Gibbs Sampling

Authors: Weijia Xu, Andrzej Banburski, Nebojsa Jojic

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on 20 challenging reasoning tasks. Results show that Reprompting outperforms humanwritten Co T prompts substantially by +9.4 points on average.
Researcher Affiliation	Industry	1Microsoft Research, Redmond, USA. Correspondence to: Weijia Xu <weijiaxu@microsoft.com>.
Pseudocode	Yes	Algorithm 1: Reprompting algorithm
Open Source Code	No	The paper does not provide any specific link to its own source code repository or state that the code is publicly available.
Open Datasets	Yes	We evaluate Reprompting on 20 tasks from three reasoning benchmarks including Big-Bench Hard (BBH) (Suzgun et al., 2022), GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021)
Dataset Splits	No	The paper describes using 20 training examples and early stopping based on training accuracy, but it does not explicitly define or refer to a separate “validation” dataset split for hyperparameter tuning or model selection.
Hardware Specification	No	The paper states, “We use the Open AI APIs for all our experiments,” which means the experiments were run on OpenAI's infrastructure, but it does not provide specific details about the underlying hardware (e.g., GPU models, CPU types) used by OpenAI.
Software Dependencies	No	The paper mentions the specific LLMs used (gpt-3.5-turbo and text-davinci-003) but does not list any specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks).
Experiment Setup	Yes	We set the number of examples in the prompt by K = 5. We run Reprompting for a maximum of M = 20,000 iterations. We allow for early stopping if the average training accuracy stops increasing for 1,000 iterations. For the rejection probability... we choose pre j = 0.99. For both LLMs, we set the maximum number of output tokens to 500, top_p = 0.5, zero frequency and presence penalty. Additionally, we include END as the stop word. We set the temperature to 1.0 for Reprompting and 0.0 for testing.