Reprompting: Automated Chain-of-Thought Prompt Inference Through Gibbs Sampling

Authors: Weijia Xu, Andrzej Banburski, Nebojsa Jojic

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on 20 challenging reasoning tasks. Results show that Reprompting outperforms humanwritten Co T prompts substantially by +9.4 points on average.
Researcher Affiliation Industry 1Microsoft Research, Redmond, USA. Correspondence to: Weijia Xu <weijiaxu@microsoft.com>.
Pseudocode Yes Algorithm 1: Reprompting algorithm
Open Source Code No The paper does not provide any specific link to its own source code repository or state that the code is publicly available.
Open Datasets Yes We evaluate Reprompting on 20 tasks from three reasoning benchmarks including Big-Bench Hard (BBH) (Suzgun et al., 2022), GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021)
Dataset Splits No The paper describes using 20 training examples and early stopping based on training accuracy, but it does not explicitly define or refer to a separate “validation” dataset split for hyperparameter tuning or model selection.
Hardware Specification No The paper states, “We use the Open AI APIs for all our experiments,” which means the experiments were run on OpenAI's infrastructure, but it does not provide specific details about the underlying hardware (e.g., GPU models, CPU types) used by OpenAI.
Software Dependencies No The paper mentions the specific LLMs used (gpt-3.5-turbo and text-davinci-003) but does not list any specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks).
Experiment Setup Yes We set the number of examples in the prompt by K = 5. We run Reprompting for a maximum of M = 20,000 iterations. We allow for early stopping if the average training accuracy stops increasing for 1,000 iterations. For the rejection probability... we choose pre j = 0.99. For both LLMs, we set the maximum number of output tokens to 500, top_p = 0.5, zero frequency and presence penalty. Additionally, we include END as the stop word. We set the temperature to 1.0 for Reprompting and 0.0 for testing.