Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

In-Context Principle Learning from Mistakes

Authors: Tianjun Zhang, Aman Madaan, Luyu Gao, Steven Zheng, Swaroop Mishra, Yiming Yang, Niket Tandon, Uri Alon

ICML 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate LEAP on a wide range of benchmarks, including multi-hop question answering (Hotpot QA), textual QA (DROP), Big-Bench Hard reasoning, and math problems (GSM8K and MATH); in all these benchmarks, LEAP improves the strongest available LLMs such as GPT-3.5-turbo, GPT-4, GPT-4-turbo and Claude-2.1.
Researcher Affiliation Collaboration 1UC Berkeley 2Carnegie Mellon University 3Google Deep Mind 4AI2. Correspondence to: Tianjun Zhang <EMAIL>, Aman Madaan <EMAIL>, Uri Alon <EMAIL>.
Pseudocode Yes The complete algorithm is summarized in Algorithm 1.
Open Source Code No The paper does not provide an explicit statement or link to open-source code for the described methodology.
Open Datasets Yes We evaluated LEAP across various reasoning tasks, including Hotpot QA (Yang et al., 2018b), DROP (Dua et al., 2019a), MATH (Hendrycks et al., 2021), GSM8K (Cobbe et al., 2021), and Big-Bench Hard (Suzgun et al., 2022).
Dataset Splits No The paper describes the number of few-shot examples used for in-context learning (e.g., "3 examples for each", "6 examples"), but it does not specify traditional dataset splits for training/validation/testing beyond these few-shot examples in the prompt. While a "validation set" is mentioned in a recommendation for real-life scenarios, it's not described as part of their experimental setup's splits.
Hardware Specification No The paper specifies the LLMs used (e.g., GPT-3.5-turbo, GPT-4, Claude-2.1, Gemini Pro), but does not detail the specific hardware (GPU, CPU models, etc.) on which these models were run or accessed for the experiments.
Software Dependencies Yes We evaluated LEAP across a wide range of base models, including GPT-3.5-turbo (version -0613), GPT-4 (version -0613), GPT-4-turbo (version -1106), Claude-2.1, and Gemini Pro (Gemini Team Google, 2023).
Experiment Setup Yes For each input xi, we sample n =15 outputs with a non-zero temperature, producing a varied set of potential solutions... We repeated every run 3 times with a temperature of zero and report the average.