In-Context Principle Learning from Mistakes
Authors: Tianjun Zhang, Aman Madaan, Luyu Gao, Steven Zheng, Swaroop Mishra, Yiming Yang, Niket Tandon, Uri Alon
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate LEAP on a wide range of benchmarks, including multi-hop question answering (Hotpot QA), textual QA (DROP), Big-Bench Hard reasoning, and math problems (GSM8K and MATH); in all these benchmarks, LEAP improves the strongest available LLMs such as GPT-3.5-turbo, GPT-4, GPT-4-turbo and Claude-2.1. |
| Researcher Affiliation | Collaboration | 1UC Berkeley 2Carnegie Mellon University 3Google Deep Mind 4AI2. Correspondence to: Tianjun Zhang <tianjunz@berkeley.edu>, Aman Madaan <amadaan@cs.cmu.edu>, Uri Alon <urialon@google.com>. |
| Pseudocode | Yes | The complete algorithm is summarized in Algorithm 1. |
| Open Source Code | No | The paper does not provide an explicit statement or link to open-source code for the described methodology. |
| Open Datasets | Yes | We evaluated LEAP across various reasoning tasks, including Hotpot QA (Yang et al., 2018b), DROP (Dua et al., 2019a), MATH (Hendrycks et al., 2021), GSM8K (Cobbe et al., 2021), and Big-Bench Hard (Suzgun et al., 2022). |
| Dataset Splits | No | The paper describes the number of few-shot examples used for in-context learning (e.g., "3 examples for each", "6 examples"), but it does not specify traditional dataset splits for training/validation/testing beyond these few-shot examples in the prompt. While a "validation set" is mentioned in a recommendation for real-life scenarios, it's not described as part of their experimental setup's splits. |
| Hardware Specification | No | The paper specifies the LLMs used (e.g., GPT-3.5-turbo, GPT-4, Claude-2.1, Gemini Pro), but does not detail the specific hardware (GPU, CPU models, etc.) on which these models were run or accessed for the experiments. |
| Software Dependencies | Yes | We evaluated LEAP across a wide range of base models, including GPT-3.5-turbo (version -0613), GPT-4 (version -0613), GPT-4-turbo (version -1106), Claude-2.1, and Gemini Pro (Gemini Team Google, 2023). |
| Experiment Setup | Yes | For each input xi, we sample n =15 outputs with a non-zero temperature, producing a varied set of potential solutions... We repeated every run 3 times with a temperature of zero and report the average. |