Reprompting: Automated Chain-of-Thought Prompt Inference Through Gibbs Sampling
Authors: Weijia Xu, Andrzej Banburski, Nebojsa Jojic
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on 20 challenging reasoning tasks. Results show that Reprompting outperforms humanwritten Co T prompts substantially by +9.4 points on average. |
| Researcher Affiliation | Industry | 1Microsoft Research, Redmond, USA. Correspondence to: Weijia Xu <weijiaxu@microsoft.com>. |
| Pseudocode | Yes | Algorithm 1: Reprompting algorithm |
| Open Source Code | No | The paper does not provide any specific link to its own source code repository or state that the code is publicly available. |
| Open Datasets | Yes | We evaluate Reprompting on 20 tasks from three reasoning benchmarks including Big-Bench Hard (BBH) (Suzgun et al., 2022), GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) |
| Dataset Splits | No | The paper describes using 20 training examples and early stopping based on training accuracy, but it does not explicitly define or refer to a separate “validation” dataset split for hyperparameter tuning or model selection. |
| Hardware Specification | No | The paper states, “We use the Open AI APIs for all our experiments,” which means the experiments were run on OpenAI's infrastructure, but it does not provide specific details about the underlying hardware (e.g., GPU models, CPU types) used by OpenAI. |
| Software Dependencies | No | The paper mentions the specific LLMs used (gpt-3.5-turbo and text-davinci-003) but does not list any specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks). |
| Experiment Setup | Yes | We set the number of examples in the prompt by K = 5. We run Reprompting for a maximum of M = 20,000 iterations. We allow for early stopping if the average training accuracy stops increasing for 1,000 iterations. For the rejection probability... we choose pre j = 0.99. For both LLMs, we set the maximum number of output tokens to 500, top_p = 0.5, zero frequency and presence penalty. Additionally, we include END as the stop word. We set the temperature to 1.0 for Reprompting and 0.0 for testing. |