Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
In-Context Principle Learning from Mistakes
Authors: Tianjun Zhang, Aman Madaan, Luyu Gao, Steven Zheng, Swaroop Mishra, Yiming Yang, Niket Tandon, Uri Alon
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate LEAP on a wide range of benchmarks, including multi-hop question answering (Hotpot QA), textual QA (DROP), Big-Bench Hard reasoning, and math problems (GSM8K and MATH); in all these benchmarks, LEAP improves the strongest available LLMs such as GPT-3.5-turbo, GPT-4, GPT-4-turbo and Claude-2.1. |
| Researcher Affiliation | Collaboration | 1UC Berkeley 2Carnegie Mellon University 3Google Deep Mind 4AI2. Correspondence to: Tianjun Zhang <EMAIL>, Aman Madaan <EMAIL>, Uri Alon <EMAIL>. |
| Pseudocode | Yes | The complete algorithm is summarized in Algorithm 1. |
| Open Source Code | No | The paper does not provide an explicit statement or link to open-source code for the described methodology. |
| Open Datasets | Yes | We evaluated LEAP across various reasoning tasks, including Hotpot QA (Yang et al., 2018b), DROP (Dua et al., 2019a), MATH (Hendrycks et al., 2021), GSM8K (Cobbe et al., 2021), and Big-Bench Hard (Suzgun et al., 2022). |
| Dataset Splits | No | The paper describes the number of few-shot examples used for in-context learning (e.g., "3 examples for each", "6 examples"), but it does not specify traditional dataset splits for training/validation/testing beyond these few-shot examples in the prompt. While a "validation set" is mentioned in a recommendation for real-life scenarios, it's not described as part of their experimental setup's splits. |
| Hardware Specification | No | The paper specifies the LLMs used (e.g., GPT-3.5-turbo, GPT-4, Claude-2.1, Gemini Pro), but does not detail the specific hardware (GPU, CPU models, etc.) on which these models were run or accessed for the experiments. |
| Software Dependencies | Yes | We evaluated LEAP across a wide range of base models, including GPT-3.5-turbo (version -0613), GPT-4 (version -0613), GPT-4-turbo (version -1106), Claude-2.1, and Gemini Pro (Gemini Team Google, 2023). |
| Experiment Setup | Yes | For each input xi, we sample n =15 outputs with a non-zero temperature, producing a varied set of potential solutions... We repeated every run 3 times with a temperature of zero and report the average. |