Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Authors: Asma Ghandeharioun, Judy Hanwen Shen, Natasha Jaques, Craig Ferguson, Noah Jones, Agata Lapedriza, Rosalind Picard

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we investigate interactive human evaluation and provide evidence for its necessity; we then introduce a novel, model-agnostic, and dataset-agnostic method to approximate it. In particular, we propose a self-play scenario where the dialog system talks to itself and we calculate a combination of proxies such as sentiment and semantic coherence on the conversation trajectory. We show that this metric is capable of capturing the human-rated quality of a dialog model better than any automated metric known to-date, achieving a significant Pearson correlation (r > .7, p < .05). To investigate the strengths of this novel metric and interactive evaluation in comparison to state-of-the-art metrics and human evaluation of static conversations, we perform extended experiments with a set of models...
Researcher Affiliation Academia Department of Media Arts and Science Massachusetts Institute of Technology Cambridge, MA 02139 {asma_gh,judyshen,jaquesn}@mit.edu {fergusoc,ncjones,agata}@mit.edu, picard@media.mit.edu
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes All the code, data, and interactive evaluation platform resulting from our work are publicly available. The code for all our models is available at https://github.com/natashamjaques/neural_chat and was originally based on [4].
Open Datasets Yes A common source of data for open-domain dialog systems is movie scripts, among which the CORNELL dataset [38] is the largest and most commonly used. This REDDIT dataset is available at https://affect.media.mit.edu/neural_chat/datasets.
Dataset Splits No The paper describes a 'leave-one-bot-out' scenario for validating their hybrid metric, but does not explicitly provide the training, validation, and test dataset splits for the models themselves (e.g., percentage splits or sample counts).
Hardware Specification No The paper mentions 'providing computing resources' in the acknowledgments, but does not specify any particular hardware details such as GPU models, CPU types, or memory specifications used for the experiments.
Software Dependencies No The paper mentions using specific models like a sentiment detector [32] and Infersent [33], and states the code is available, but does not list specific version numbers for software dependencies (e.g., Python, PyTorch/TensorFlow, or other libraries).
Experiment Setup Yes For details regarding hyper-parameter tuning refer to A.12. The hyperparameters were chosen based on empirical results. The values for our models are provided in Table 5.