Agent Instructs Large Language Models to be General Zero-Shot Reasoners

Authors: Nicholas Crispino, Kyle Montgomery, Fankun Zeng, Dawn Song, Chenguang Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically evaluate the zero-shot reasoning abilities of LLMs on a wide set of language understanding tasks across 29 datasets (including 53 subsets), spanning generation, classification, and reasoning. Zero-shot Agent Instruct obtains state-of-the-art performance on 20 datasets. We conduct our evaluation on three state-of-the-art LLMs, namely, Vicuna (Chiang et al., 2023), Llama-2-chat (Touvron et al., 2023b), and GPT-3.5 Turbo (Open AI, 2022).
Researcher Affiliation Academia 1Washington University in St. Louis, MO, USA 2UC Berkeley, CA, USA.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. The method is described in narrative text and illustrated with figures.
Open Source Code Yes The code is available at https://github.com/wang-research-lab/agentinstruct.
Open Datasets Yes We evaluate zero-shot Agent Instruct on a wide selection of datasets consisting of all HELM core scenarios (Liang et al., 2023) and all the reasoning datasets used to benchmark zero-shot Co T (Kojima et al., 2022). The datasets are listed in Table 4, along with their categorizations.
Dataset Splits No The paper specifies test instance counts for various datasets (e.g., 'The dataset contains 395 test instances.' for Add Sub, 'We sample 1,000 test instances following HELM.' for Bool Q). For IMDB, it mentions '25,000 for training and 25,000 for testing'. However, it does not consistently provide explicit details for a validation split across all experiments, often referring to 'test instances' or 'test sets' without detailing how the validation set was created or used for all datasets.
Hardware Specification Yes Inference requests on Vicuna and Llama-2-70b-chat were submitted from HELM to a Torch Serve API running on a local cluster containing 2 nodes, each with 8x NVIDIA RTX A6000 GPUs.
Software Dependencies Yes We use GPT-4 (Open AI, 2023) inside our agent with default temperature of 0.3 and the snapshot gpt-4-0613 when generating instructions. Using the GPT-3.5 tokenizer (cl100k base).
Experiment Setup Yes We use GPT-4 (Open AI, 2023) inside our agent with default temperature of 0.3 and the snapshot gpt-4-0613 when generating instructions. ... All inference was done using a temperature of 0.0, except on datasets that involved summarization, where a temperature of 0.3 was used. For the reasoning extraction prompt, we request a maximum of 512 new tokens, and for the answer extraction prompt, the number of tokens requested is specific to each dataset.