reproducibilityindex.ai

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Authors: Zhibin Gou, Zhihong Shao, Yeyun Gong, yelong shen, Yujiu Yang, Nan Duan, Weizhu Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive evaluations involving free-form question answering, mathematical program synthesis, and toxicity reduction demonstrate that CRITIC consistently enhances the performance of LLMs. Meanwhile, our research highlights the crucial importance of external feedback in promoting the ongoing self-improvement of LLMs1. We evaluate our approach on a range of LLMs, including Chat GPT, Text-Davinci-003, and open-source LLa MA-2 variants (7B, 13B, and 70B), spanning three distinct tasks: free-form question answering, mathematical program synthesis, and toxicity reduction. Our findings demonstrate that CRITIC consistently surpasses prior techniques, obviating the need for supplementary data or training.
Researcher Affiliation	Collaboration	Zhibin Gou12 , Zhihong Shao12 , Yeyun Gong2, Yelong Shen3, Yujiu Yang1 , Nan Duan2, Weizhu Chen3 1Tsinghua University 2Microsoft Research Asia, 3Microsoft Azure AI
Pseudocode	Yes	Algorithm 1 CRITIC algorithm
Open Source Code	Yes	1Code released at https://github.com/microsoft/ProphetNet/tree/master/CRITIC. Our web tools released at https://github.com/ZubinGou/llm-agent-web-tools.
Open Datasets	Yes	Ambig NQ (Min et al., 2020), an enhanced version of Natural Question (Kwiatkowski et al., 2019) that employs multi-reference annotations to resolve ambiguity, along with Trivia QA (Joshi et al., 2017) and Hotpot QA (Yang et al., 2018). We adopt diverse arithmetic reasoning datasets including GSM8k (Cobbe et al., 2021), SVAMP (Patel et al., 2021), and Tab MWP (Lu et al., 2023), we utilize the official test split.
Dataset Splits	Yes	Due to budget constraints, we randomly sampled 500 examples from the validation set of each dataset and reported the results in terms of EM and F1 scores.
Hardware Specification	No	The paper does not explicitly specify the hardware (e.g., exact GPU/CPU models, memory amounts) used for running its experiments. It mentions various LLMs used but not the underlying computational infrastructure.
Software Dependencies	No	The paper mentions using a 'Python interpreter' and 'PERSPECTIVE API' but does not provide specific version numbers for these or any other software libraries, frameworks, or dependencies used in the experiments.
Experiment Setup	Yes	The Maximum number of interactions is set to 7. We use Co T (Wei et al., 2022) to produce an initial answer and then correct up to n = 3 rounds, stopping early if the answer remains the same for two consecutive corrections. We use greedy decoding for all results. We use the original error messages from the interpreter, such as Name Error("num_pizza is not defined") or Time out , and represent them in natural language form as Execution: {error message} . For execution results, we use the value of the variable answer after the execution is completed. We use program-of-thought (Po T) (Chen et al., 2022) to generate the initial program and then apply a maximum of n = 4 corrections, stopping if the executed result remains unchanged for two consecutive revisions. We use greedy decoding for initial results following previous works (Chen et al., 2022), and sampling with p = 0.5 for correction to avoid loopping. We set the maximum iterations n to 4, and terminate the detoxification when the overall toxicity of an output falls below 10%. We use nucleus sampling with p = 0.9, the same as all the baselines (Welleck et al., 2023).