reproducibilityindex.ai

Agent Instructs Large Language Models to be General Zero-Shot Reasoners

Authors: Nicholas Crispino, Kyle Montgomery, Fankun Zeng, Dawn Song, Chenguang Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically evaluate the zero-shot reasoning abilities of LLMs on a wide set of language understanding tasks across 29 datasets (including 53 subsets), spanning generation, classification, and reasoning. Zero-shot Agent Instruct obtains state-of-the-art performance on 20 datasets. We conduct our evaluation on three state-of-the-art LLMs, namely, Vicuna (Chiang et al., 2023), Llama-2-chat (Touvron et al., 2023b), and GPT-3.5 Turbo (Open AI, 2022).
Researcher Affiliation	Academia	1Washington University in St. Louis, MO, USA 2UC Berkeley, CA, USA.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. The method is described in narrative text and illustrated with figures.
Open Source Code	Yes	The code is available at https://github.com/wang-research-lab/agentinstruct.
Open Datasets	Yes	We evaluate zero-shot Agent Instruct on a wide selection of datasets consisting of all HELM core scenarios (Liang et al., 2023) and all the reasoning datasets used to benchmark zero-shot Co T (Kojima et al., 2022). The datasets are listed in Table 4, along with their categorizations.
Dataset Splits	No	The paper specifies test instance counts for various datasets (e.g., 'The dataset contains 395 test instances.' for Add Sub, 'We sample 1,000 test instances following HELM.' for Bool Q). For IMDB, it mentions '25,000 for training and 25,000 for testing'. However, it does not consistently provide explicit details for a validation split across all experiments, often referring to 'test instances' or 'test sets' without detailing how the validation set was created or used for all datasets.
Hardware Specification	Yes	Inference requests on Vicuna and Llama-2-70b-chat were submitted from HELM to a Torch Serve API running on a local cluster containing 2 nodes, each with 8x NVIDIA RTX A6000 GPUs.
Software Dependencies	Yes	We use GPT-4 (Open AI, 2023) inside our agent with default temperature of 0.3 and the snapshot gpt-4-0613 when generating instructions. Using the GPT-3.5 tokenizer (cl100k base).
Experiment Setup	Yes	We use GPT-4 (Open AI, 2023) inside our agent with default temperature of 0.3 and the snapshot gpt-4-0613 when generating instructions. ... All inference was done using a temperature of 0.0, except on datasets that involved summarization, where a temperature of 0.3 was used. For the reasoning extraction prompt, we request a maximum of 512 new tokens, and for the answer extraction prompt, the number of tokens requested is specific to each dataset.