reproducibilityindex.ai

Towards an automatic Turing test: Learning to evaluate dialogue responses

Authors: Ryan Lowe, Michael Noseworthy, Iulian V. Serban, Nicolas Angelard-Gontier, Yoshua Bengio, Joelle Pineau

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present an evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores. We show that the ADEM model s predictions correlate signiﬁcantly, and at level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level. We also show that ADEM can generalize to evaluating dialogue models unseen during training, an important step for automatic dialogue evaluation.
Researcher Affiliation	Academia	Reasoning and Learning Lab, School of Computer Science, Mc Gill University Montreal Institute for Learning Algorithms, Universit e de Montr eal CIFAR Senior Fellow
Pseudocode	No	The paper describes the ADEM model and its training mathematically and textually, but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code	No	We will provide open-source implementations of the model upon publication.
Open Datasets	Yes	To train a model to predict human scores to dialogue responses, we ﬁrst collect a dataset of human judgements (scores) of Twitter responses using the crowdsourcing platform Amazon Mechanical Turk (AMT)... we use the Twitter Corpus (Ritter et al., 2011), as such models are pre-trained and readily available.
Dataset Splits	Yes	Table 1: Statistics of the dialogue response evaluation dataset. Each example is in the form (context, model response, reference response, human score)... # Training examples 2,872 # Validation examples 616 # Test examples 616
Hardware Specification	Yes	We show the evaluation time on the test set for ADEM on both CPU and a Titan X GPU (using Theano, without cud NN) in Table 11.
Software Dependencies	No	The paper mentions software tools like "Adam" and "Theano" but does not provide specific version numbers for these or any other software dependencies, which is required for reproducibility.
Experiment Setup	Yes	For training VHRED, we use a context embedding size of 2000. ...Our best ADEM model used γ = 0.02, a = 0.01, and b = 16. For ADEM with tweet2vec embeddings, we did a similar hyperparameter searched, and used n = 150, γ = 0.01, a = 0.01, and b = 16. ...we drop words in the decoder with a ﬁxed rate of 25%, and we anneal the KL-divergence term linearly from 0 to 1 over the ﬁrst 60,000 batches.