Towards Credible Human Evaluation of Open-Domain Dialog Systems Using Interactive Setup
Authors: Sijia Liu, Patrick Lange, Behnam Hedayatnia, Alexandros Papangelis, Di Jin, Andrew Wirth, Yang Liu, Dilek Hakkani-Tur
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our comprehensive human evaluation results shed light on how to conduct credible human evaluations of open domain dialog systems using the interactive setup |
| Researcher Affiliation | Industry | Amazon Alexa AI {sijial, patlange, behnam, papangea, djinamzn, wirandre, yangliud, hakkanit}@amazon.com |
| Pseudocode | No | The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor are there any structured code-like blocks describing procedures. |
| Open Source Code | No | The paper mentions leveraging 'the Hugging Face s transformers library for all our models' and provides a link to its GitHub repository, but it does not state that the authors' own code for the presented methodology is open-source or publicly available. |
| Open Datasets | Yes | GPT2-XL/GPT2-M fine-tuned on Blended Skill Talk (BST) Dataset (Smith et al. 2020); GPT2-XL fine-tuned on Topical Chat (TCS) Dataset (Gopalakrishnan et al. 2019); GPT2-XL fine-tuned on Wizard-of-Wikipedia (Wo W) Dataset (Dinan et al. 2018). |
| Dataset Splits | No | The paper discusses the datasets that GPT2 models were fine-tuned on and statistical power for evaluations, but it does not provide explicit details about train/validation/test dataset splits used for the interactive evaluation data collected in their experiments. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU models, CPU types, or cloud computing instance specifications used for running their experiments or fine-tuning the models. |
| Software Dependencies | No | The paper mentions using 'the Hugging Face s transformers library for all our models' but does not specify a version number for this library or any other software dependencies like Python or specific deep learning frameworks with their versions. |
| Experiment Setup | No | The paper describes the setup of the human interactive evaluation mechanisms (e.g., SOBA, SATA) and how models were fine-tuned (e.g., 'nucleus sampling'), but it does not provide concrete hyperparameter values such as learning rates, batch sizes, optimizer settings, or detailed training schedules that would constitute a reproducible experimental setup. |