reproducibilityindex.ai

Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing

Authors: Xinyu Hu, Pengfei Tang, Simiao Zuo, Zihan Wang, Bowen Song, Qiang Lou, Jian Jiao, Denis X Charles

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that Evoke significantly outperforms existing methods.
Researcher Affiliation	Collaboration	Microsoft University of Washington University of Michigan
Pseudocode	Yes	Algorithm 1: Evoke
Open Source Code	No	The paper does not provide an explicit statement about releasing its source code or a link to a code repository for the described methodology.
Open Datasets	Yes	We perform a comprehensive evaluation on eight tasks from Instruction Induction (Honovich et al., 2022) and Big Bench Instruction Induction (BBII) (Zhou et al., 2022), including orthography starts with, common concept, rhymes, movie recommendation, logical fallacy detection, presuppositions as nli, winowhy, epistemic reasoning.
Dataset Splits	No	The paper states: 'For each task, we divide the dataset randomly into two sets, 60% of the data is allocated for training (prompt refinement) and the remaining 40% is for testing (prompt evaluation).' It does not explicitly mention a validation set or split.
Hardware Specification	No	The paper states: 'In all experiments, we utilize the Azure Open AI API service (GPT-4) for the involved LLMs.' It does not specify the underlying hardware specifications (e.g., specific GPU models, CPUs) used for running the experiments beyond this API usage.
Software Dependencies	No	The paper mentions using 'Azure Open AI API service (GPT-4)' but does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks.
Experiment Setup	Yes	The paper describes the workflow of Evoke, including the roles of LLM-Author, LLM-Reviewer, and LLM-Selector, their prompts, and the iterative refinement process. For example, 'The workflow comprises three steps: First, the LLM-Author edits prompts from previous iterations, taking into account the past edits and the feedback from the LLM-Reviewer. Second, the LLM-Reviewer scores the revised prompts from the LLM-Author, and the top-n candidates with the highest scores are selected for subsequent procedures. ... Details of the algorithm can be found in Algorithm 1.'