reproducibilityindex.ai

In-Context Principle Learning from Mistakes

Authors: Tianjun Zhang, Aman Madaan, Luyu Gao, Steven Zheng, Swaroop Mishra, Yiming Yang, Niket Tandon, Uri Alon

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate LEAP on a wide range of benchmarks, including multi-hop question answering (Hotpot QA), textual QA (DROP), Big-Bench Hard reasoning, and math problems (GSM8K and MATH); in all these benchmarks, LEAP improves the strongest available LLMs such as GPT-3.5-turbo, GPT-4, GPT-4-turbo and Claude-2.1.
Researcher Affiliation	Collaboration	1UC Berkeley 2Carnegie Mellon University 3Google Deep Mind 4AI2. Correspondence to: Tianjun Zhang <tianjunz@berkeley.edu>, Aman Madaan <amadaan@cs.cmu.edu>, Uri Alon <urialon@google.com>.
Pseudocode	Yes	The complete algorithm is summarized in Algorithm 1.
Open Source Code	No	The paper does not provide an explicit statement or link to open-source code for the described methodology.
Open Datasets	Yes	We evaluated LEAP across various reasoning tasks, including Hotpot QA (Yang et al., 2018b), DROP (Dua et al., 2019a), MATH (Hendrycks et al., 2021), GSM8K (Cobbe et al., 2021), and Big-Bench Hard (Suzgun et al., 2022).
Dataset Splits	No	The paper describes the number of few-shot examples used for in-context learning (e.g., "3 examples for each", "6 examples"), but it does not specify traditional dataset splits for training/validation/testing beyond these few-shot examples in the prompt. While a "validation set" is mentioned in a recommendation for real-life scenarios, it's not described as part of their experimental setup's splits.
Hardware Specification	No	The paper specifies the LLMs used (e.g., GPT-3.5-turbo, GPT-4, Claude-2.1, Gemini Pro), but does not detail the specific hardware (GPU, CPU models, etc.) on which these models were run or accessed for the experiments.
Software Dependencies	Yes	We evaluated LEAP across a wide range of base models, including GPT-3.5-turbo (version -0613), GPT-4 (version -0613), GPT-4-turbo (version -1106), Claude-2.1, and Gemini Pro (Gemini Team Google, 2023).
Experiment Setup	Yes	For each input xi, we sample n =15 outputs with a non-zero temperature, producing a varied set of potential solutions... We repeated every run 3 times with a temperature of zero and report the average.