Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Reprompting: Automated Chain-of-Thought Prompt Inference Through Gibbs Sampling
Authors: Weijia Xu, Andrzej Banburski, Nebojsa Jojic
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on 20 challenging reasoning tasks. Results show that Reprompting outperforms humanwritten Co T prompts substantially by +9.4 points on average. |
| Researcher Affiliation | Industry | 1Microsoft Research, Redmond, USA. Correspondence to: Weijia Xu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1: Reprompting algorithm |
| Open Source Code | No | The paper does not provide any specific link to its own source code repository or state that the code is publicly available. |
| Open Datasets | Yes | We evaluate Reprompting on 20 tasks from three reasoning benchmarks including Big-Bench Hard (BBH) (Suzgun et al., 2022), GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) |
| Dataset Splits | No | The paper describes using 20 training examples and early stopping based on training accuracy, but it does not explicitly define or refer to a separate โvalidationโ dataset split for hyperparameter tuning or model selection. |
| Hardware Specification | No | The paper states, โWe use the Open AI APIs for all our experiments,โ which means the experiments were run on OpenAI's infrastructure, but it does not provide specific details about the underlying hardware (e.g., GPU models, CPU types) used by OpenAI. |
| Software Dependencies | No | The paper mentions the specific LLMs used (gpt-3.5-turbo and text-davinci-003) but does not list any specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks). |
| Experiment Setup | Yes | We set the number of examples in the prompt by K = 5. We run Reprompting for a maximum of M = 20,000 iterations. We allow for early stopping if the average training accuracy stops increasing for 1,000 iterations. For the rejection probability... we choose pre j = 0.99. For both LLMs, we set the maximum number of output tokens to 500, top_p = 0.5, zero frequency and presence penalty. Additionally, we include END as the stop word. We set the temperature to 1.0 for Reprompting and 0.0 for testing. |