Evaluating Large Language Models at Evaluating Instruction Following

Authors: Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, Danqi Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper investigates the efficacy of these LLM evaluators , particularly in using them to assess instruction following, a metric that gauges how closely generated text adheres to the given instruction. We introduce a challenging meta-evaluation benchmark, LLMBAR, designed to test the ability of an LLM evaluator in discerning instructionfollowing outputs. The authors manually curated 419 pairs of outputs, one adhering to instructions while the other diverging, yet may possess deceptive qualities that mislead an LLM evaluator, e.g., a more engaging tone. Contrary to existing meta-evaluation, we discover that different evaluators (i.e., combinations of LLMs and prompts) exhibit distinct performance on LLMBAR and even the highestscoring ones have substantial room for improvement. We also present a novel suite of prompting strategies that further close the gap between LLM and human evaluators. With LLMBAR, we hope to offer more insight into LLM evaluators and foster future research in developing better instruction-following models. [...] We evaluate different evaluators (combinations of LLMs and prompting strategies) on LLMBAR. For each output pair, we query the evaluator twice with swapped orders. We then report average accuracy (Acc.) and positional agreement rate (Agr.).
Researcher Affiliation Academia 1Department of Computer Science and Technology, Tsinghua University 2Princeton Language and Intelligence (PLI), Princeton University 3Department of Computer Science, University of Illinois Urbana-Champaign
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes 1Our data and code are available at https://github.com/princeton-nlp/LLMBar.
Open Datasets Yes LLMBAR consists of two parts: (1) The NATURAL set collects instances from existing human-preference datasets. We further filter and modify them to ensure that an objective preference exists for each instance. (2) In the ADVERSARIAL set, the authors create the dispreferred output such that it deviates from the instruction but often has good superficial qualities and may thus distract the evaluator. [...] We first randomly sample a set of instructions and corresponding output pairs (I, O1, O2) from Alpaca Farm (Dubois et al., 2023)2 and LLMEval2 (Zhang et al., 2023)3. [...] 2The instructions I in Alpaca Farm were constructed using self-instruct (Wang et al., 2023d), while O1 and O2 are generated by instruction-tuned LLa MA-7B (Touvron et al., 2023a). 3LLMEval2 is constructed by aggregating data from 15 existing preference datasets, containing a mix of human-written and model-generated instructions and outputs.
Dataset Splits No The paper does not explicitly mention training/validation/test splits for the LLMBAR dataset itself, as it functions as a meta-evaluation benchmark. It uses LLMBAR as a whole for evaluating LLM evaluators.
Hardware Specification No The paper mentions using "proprietary and open-source LLMs as base models" and notes "The API usage may incur high costs and delays," but it does not specify any hardware details (e.g., GPU models, CPU types) used for running their experiments.
Software Dependencies No The paper does not provide specific version numbers for software dependencies used in the experiments (e.g., Python, PyTorch, or specific libraries).
Experiment Setup Yes To enhance reproducibility, we set the temperature to 0 for proprietary models, and utilize greedy decoding for open-source models. [...] For each output pair, we query the evaluator twice with swapped orders. We then report average accuracy (Acc.) and positional agreement rate (Agr.).