Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems
Authors: Asma Ghandeharioun, Judy Hanwen Shen, Natasha Jaques, Craig Ferguson, Noah Jones, Agata Lapedriza, Rosalind Picard
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we investigate interactive human evaluation and provide evidence for its necessity; we then introduce a novel, model-agnostic, and dataset-agnostic method to approximate it. In particular, we propose a self-play scenario where the dialog system talks to itself and we calculate a combination of proxies such as sentiment and semantic coherence on the conversation trajectory. We show that this metric is capable of capturing the human-rated quality of a dialog model better than any automated metric known to-date, achieving a significant Pearson correlation (r > .7, p < .05). To investigate the strengths of this novel metric and interactive evaluation in comparison to state-of-the-art metrics and human evaluation of static conversations, we perform extended experiments with a set of models... |
| Researcher Affiliation | Academia | Department of Media Arts and Science Massachusetts Institute of Technology Cambridge, MA 02139 {asma_gh,judyshen,jaquesn}@mit.edu {fergusoc,ncjones,agata}@mit.edu, picard@media.mit.edu |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | All the code, data, and interactive evaluation platform resulting from our work are publicly available. The code for all our models is available at https://github.com/natashamjaques/neural_chat and was originally based on [4]. |
| Open Datasets | Yes | A common source of data for open-domain dialog systems is movie scripts, among which the CORNELL dataset [38] is the largest and most commonly used. This REDDIT dataset is available at https://affect.media.mit.edu/neural_chat/datasets. |
| Dataset Splits | No | The paper describes a 'leave-one-bot-out' scenario for validating their hybrid metric, but does not explicitly provide the training, validation, and test dataset splits for the models themselves (e.g., percentage splits or sample counts). |
| Hardware Specification | No | The paper mentions 'providing computing resources' in the acknowledgments, but does not specify any particular hardware details such as GPU models, CPU types, or memory specifications used for the experiments. |
| Software Dependencies | No | The paper mentions using specific models like a sentiment detector [32] and Infersent [33], and states the code is available, but does not list specific version numbers for software dependencies (e.g., Python, PyTorch/TensorFlow, or other libraries). |
| Experiment Setup | Yes | For details regarding hyper-parameter tuning refer to A.12. The hyperparameters were chosen based on empirical results. The values for our models are provided in Table 5. |