Towards an automatic Turing test: Learning to evaluate dialogue responses
Authors: Ryan Lowe, Michael Noseworthy, Iulian V. Serban, Nicolas Angelard-Gontier, Yoshua Bengio, Joelle Pineau
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present an evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores. We show that the ADEM model s predictions correlate significantly, and at level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level. We also show that ADEM can generalize to evaluating dialogue models unseen during training, an important step for automatic dialogue evaluation. |
| Researcher Affiliation | Academia | Reasoning and Learning Lab, School of Computer Science, Mc Gill University Montreal Institute for Learning Algorithms, Universit e de Montr eal CIFAR Senior Fellow |
| Pseudocode | No | The paper describes the ADEM model and its training mathematically and textually, but does not provide pseudocode or a clearly labeled algorithm block. |
| Open Source Code | No | We will provide open-source implementations of the model upon publication. |
| Open Datasets | Yes | To train a model to predict human scores to dialogue responses, we first collect a dataset of human judgements (scores) of Twitter responses using the crowdsourcing platform Amazon Mechanical Turk (AMT)... we use the Twitter Corpus (Ritter et al., 2011), as such models are pre-trained and readily available. |
| Dataset Splits | Yes | Table 1: Statistics of the dialogue response evaluation dataset. Each example is in the form (context, model response, reference response, human score)... # Training examples 2,872 # Validation examples 616 # Test examples 616 |
| Hardware Specification | Yes | We show the evaluation time on the test set for ADEM on both CPU and a Titan X GPU (using Theano, without cud NN) in Table 11. |
| Software Dependencies | No | The paper mentions software tools like "Adam" and "Theano" but does not provide specific version numbers for these or any other software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | For training VHRED, we use a context embedding size of 2000. ...Our best ADEM model used γ = 0.02, a = 0.01, and b = 16. For ADEM with tweet2vec embeddings, we did a similar hyperparameter searched, and used n = 150, γ = 0.01, a = 0.01, and b = 16. ...we drop words in the decoder with a fixed rate of 25%, and we anneal the KL-divergence term linearly from 0 to 1 over the first 60,000 batches. |