reproducibilityindex.ai

Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Authors: Asma Ghandeharioun, Judy Hanwen Shen, Natasha Jaques, Craig Ferguson, Noah Jones, Agata Lapedriza, Rosalind Picard

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we investigate interactive human evaluation and provide evidence for its necessity; we then introduce a novel, model-agnostic, and dataset-agnostic method to approximate it. In particular, we propose a self-play scenario where the dialog system talks to itself and we calculate a combination of proxies such as sentiment and semantic coherence on the conversation trajectory. We show that this metric is capable of capturing the human-rated quality of a dialog model better than any automated metric known to-date, achieving a signiﬁcant Pearson correlation (r > .7, p < .05). To investigate the strengths of this novel metric and interactive evaluation in comparison to state-of-the-art metrics and human evaluation of static conversations, we perform extended experiments with a set of models...
Researcher Affiliation	Academia	Department of Media Arts and Science Massachusetts Institute of Technology Cambridge, MA 02139 {asma_gh,judyshen,jaquesn}@mit.edu {fergusoc,ncjones,agata}@mit.edu, picard@media.mit.edu
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	All the code, data, and interactive evaluation platform resulting from our work are publicly available. The code for all our models is available at https://github.com/natashamjaques/neural_chat and was originally based on [4].
Open Datasets	Yes	A common source of data for open-domain dialog systems is movie scripts, among which the CORNELL dataset [38] is the largest and most commonly used. This REDDIT dataset is available at https://affect.media.mit.edu/neural_chat/datasets.
Dataset Splits	No	The paper describes a 'leave-one-bot-out' scenario for validating their hybrid metric, but does not explicitly provide the training, validation, and test dataset splits for the models themselves (e.g., percentage splits or sample counts).
Hardware Specification	No	The paper mentions 'providing computing resources' in the acknowledgments, but does not specify any particular hardware details such as GPU models, CPU types, or memory specifications used for the experiments.
Software Dependencies	No	The paper mentions using specific models like a sentiment detector [32] and Infersent [33], and states the code is available, but does not list specific version numbers for software dependencies (e.g., Python, PyTorch/TensorFlow, or other libraries).
Experiment Setup	Yes	For details regarding hyper-parameter tuning refer to A.12. The hyperparameters were chosen based on empirical results. The values for our models are provided in Table 5.