Towards an automatic Turing test: Learning to evaluate dialogue responses

Authors: Ryan Lowe, Michael Noseworthy, Iulian V. Serban, Nicolas Angelard-Gontier, Yoshua Bengio, Joelle Pineau

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present an evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores. We show that the ADEM model s predictions correlate significantly, and at level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level. We also show that ADEM can generalize to evaluating dialogue models unseen during training, an important step for automatic dialogue evaluation.
Researcher Affiliation Academia Reasoning and Learning Lab, School of Computer Science, Mc Gill University Montreal Institute for Learning Algorithms, Universit e de Montr eal CIFAR Senior Fellow
Pseudocode No The paper describes the ADEM model and its training mathematically and textually, but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code No We will provide open-source implementations of the model upon publication.
Open Datasets Yes To train a model to predict human scores to dialogue responses, we first collect a dataset of human judgements (scores) of Twitter responses using the crowdsourcing platform Amazon Mechanical Turk (AMT)... we use the Twitter Corpus (Ritter et al., 2011), as such models are pre-trained and readily available.
Dataset Splits Yes Table 1: Statistics of the dialogue response evaluation dataset. Each example is in the form (context, model response, reference response, human score)... # Training examples 2,872 # Validation examples 616 # Test examples 616
Hardware Specification Yes We show the evaluation time on the test set for ADEM on both CPU and a Titan X GPU (using Theano, without cud NN) in Table 11.
Software Dependencies No The paper mentions software tools like "Adam" and "Theano" but does not provide specific version numbers for these or any other software dependencies, which is required for reproducibility.
Experiment Setup Yes For training VHRED, we use a context embedding size of 2000. ...Our best ADEM model used γ = 0.02, a = 0.01, and b = 16. For ADEM with tweet2vec embeddings, we did a similar hyperparameter searched, and used n = 150, γ = 0.01, a = 0.01, and b = 16. ...we drop words in the decoder with a fixed rate of 25%, and we anneal the KL-divergence term linearly from 0 to 1 over the first 60,000 batches.