IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering

Authors: Ruosen Li, Ruochen Li, Barry Wang, Xinya Du

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce an automatic evaluation framework IQA-EVAL to achieve Interactive Question Answering Evaluations... We show that: (1) our evaluation framework with GPT-4 (or Claude) as the backbone model achieves a high correlation with human evaluations on the IQA task; (2) assigning personas to LEA to better represent the crowd further significantly improves correlations.
Researcher Affiliation Collaboration Ruosen Li1, Ruochen Li1, Barry Wang 2, and Xinya Du1 1Department of Computer Science, University of Texas at Dallas 2Department of Computer Science, Carnegie Mellon University
Pseudocode No The paper describes the interaction generation and evaluation processes and provides prompt templates in Appendix C.1, C.2, and C.3, but these are not structured as formal pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/du-nlp-lab/IQA-Eval
Open Datasets Yes We apply our evaluation method on the annotated dataset from the study by Lee et al. [2023]. This dataset consists of 3641 interactions from 331 annotators. Questions in the dataset are multi-choice and are derived from the MMLU dataset [Hendrycks et al., 2020]... Ambig QA Min et al. [2020]... Hotpot QA Yang et al. [2018]... Natural Questions (Kwiatkowski et al. [2019b])
Dataset Splits No The paper mentions using various datasets for evaluation and benchmarking, but it does not explicitly specify the training, validation, and test splits (e.g., percentages or sample counts) for its experiments.
Hardware Specification No The paper discusses using various LLM models (e.g., GPT-3.5, GPT-4, Claude, Llama2, Zephyr) for experiments, but it does not specify the underlying hardware (e.g., GPU models, CPU types, or cloud instance details) used to run these experiments.
Software Dependencies No The paper mentions specific versions of LLM models used (e.g., GPT-3.5-turbo-1106, GPT-4-1106-preview, Claude-1, Llama-2-7B, Zephyr-alpha) but does not provide a list of ancillary software dependencies, such as programming languages or libraries, with specific version numbers.
Experiment Setup Yes The structured prompt includes three key components: (1) a role description; (2) a task description; and (3) instructions for the discussion... The prompt contains three parts: (1) role and task description; (2) metrics definition; and (3) evaluation instruction. Finally, all evaluation scores for metrics are calculated by averaging the results of multiple runs.