IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering
Authors: Ruosen Li, Ruochen Li, Barry Wang, Xinya Du
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce an automatic evaluation framework IQA-EVAL to achieve Interactive Question Answering Evaluations... We show that: (1) our evaluation framework with GPT-4 (or Claude) as the backbone model achieves a high correlation with human evaluations on the IQA task; (2) assigning personas to LEA to better represent the crowd further significantly improves correlations. |
| Researcher Affiliation | Collaboration | Ruosen Li1, Ruochen Li1, Barry Wang 2, and Xinya Du1 1Department of Computer Science, University of Texas at Dallas 2Department of Computer Science, Carnegie Mellon University |
| Pseudocode | No | The paper describes the interaction generation and evaluation processes and provides prompt templates in Appendix C.1, C.2, and C.3, but these are not structured as formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/du-nlp-lab/IQA-Eval |
| Open Datasets | Yes | We apply our evaluation method on the annotated dataset from the study by Lee et al. [2023]. This dataset consists of 3641 interactions from 331 annotators. Questions in the dataset are multi-choice and are derived from the MMLU dataset [Hendrycks et al., 2020]... Ambig QA Min et al. [2020]... Hotpot QA Yang et al. [2018]... Natural Questions (Kwiatkowski et al. [2019b]) |
| Dataset Splits | No | The paper mentions using various datasets for evaluation and benchmarking, but it does not explicitly specify the training, validation, and test splits (e.g., percentages or sample counts) for its experiments. |
| Hardware Specification | No | The paper discusses using various LLM models (e.g., GPT-3.5, GPT-4, Claude, Llama2, Zephyr) for experiments, but it does not specify the underlying hardware (e.g., GPU models, CPU types, or cloud instance details) used to run these experiments. |
| Software Dependencies | No | The paper mentions specific versions of LLM models used (e.g., GPT-3.5-turbo-1106, GPT-4-1106-preview, Claude-1, Llama-2-7B, Zephyr-alpha) but does not provide a list of ancillary software dependencies, such as programming languages or libraries, with specific version numbers. |
| Experiment Setup | Yes | The structured prompt includes three key components: (1) a role description; (2) a task description; and (3) instructions for the discussion... The prompt contains three parts: (1) role and task description; (2) metrics definition; and (3) evaluation instruction. Finally, all evaluation scores for metrics are calculated by averaging the results of multiple runs. |