In-Context Principle Learning from Mistakes

Authors: Tianjun Zhang, Aman Madaan, Luyu Gao, Steven Zheng, Swaroop Mishra, Yiming Yang, Niket Tandon, Uri Alon

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate LEAP on a wide range of benchmarks, including multi-hop question answering (Hotpot QA), textual QA (DROP), Big-Bench Hard reasoning, and math problems (GSM8K and MATH); in all these benchmarks, LEAP improves the strongest available LLMs such as GPT-3.5-turbo, GPT-4, GPT-4-turbo and Claude-2.1.
Researcher Affiliation Collaboration 1UC Berkeley 2Carnegie Mellon University 3Google Deep Mind 4AI2. Correspondence to: Tianjun Zhang <tianjunz@berkeley.edu>, Aman Madaan <amadaan@cs.cmu.edu>, Uri Alon <urialon@google.com>.
Pseudocode Yes The complete algorithm is summarized in Algorithm 1.
Open Source Code No The paper does not provide an explicit statement or link to open-source code for the described methodology.
Open Datasets Yes We evaluated LEAP across various reasoning tasks, including Hotpot QA (Yang et al., 2018b), DROP (Dua et al., 2019a), MATH (Hendrycks et al., 2021), GSM8K (Cobbe et al., 2021), and Big-Bench Hard (Suzgun et al., 2022).
Dataset Splits No The paper describes the number of few-shot examples used for in-context learning (e.g., "3 examples for each", "6 examples"), but it does not specify traditional dataset splits for training/validation/testing beyond these few-shot examples in the prompt. While a "validation set" is mentioned in a recommendation for real-life scenarios, it's not described as part of their experimental setup's splits.
Hardware Specification No The paper specifies the LLMs used (e.g., GPT-3.5-turbo, GPT-4, Claude-2.1, Gemini Pro), but does not detail the specific hardware (GPU, CPU models, etc.) on which these models were run or accessed for the experiments.
Software Dependencies Yes We evaluated LEAP across a wide range of base models, including GPT-3.5-turbo (version -0613), GPT-4 (version -0613), GPT-4-turbo (version -1106), Claude-2.1, and Gemini Pro (Gemini Team Google, 2023).
Experiment Setup Yes For each input xi, we sample n =15 outputs with a non-zero temperature, producing a varied set of potential solutions... We repeated every run 3 times with a temperature of zero and report the average.