Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Towards an automatic Turing test: Learning to evaluate dialogue responses
Authors: Ryan Lowe, Michael Noseworthy, Iulian V. Serban, Nicolas Angelard-Gontier, Yoshua Bengio, Joelle Pineau
ICLR 2017 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present an evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores. We show that the ADEM model s predictions correlate significantly, and at level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level. We also show that ADEM can generalize to evaluating dialogue models unseen during training, an important step for automatic dialogue evaluation. |
| Researcher Affiliation | Academia | Reasoning and Learning Lab, School of Computer Science, Mc Gill University Montreal Institute for Learning Algorithms, Universit e de Montr eal CIFAR Senior Fellow |
| Pseudocode | No | The paper describes the ADEM model and its training mathematically and textually, but does not provide pseudocode or a clearly labeled algorithm block. |
| Open Source Code | No | We will provide open-source implementations of the model upon publication. |
| Open Datasets | Yes | To train a model to predict human scores to dialogue responses, we first collect a dataset of human judgements (scores) of Twitter responses using the crowdsourcing platform Amazon Mechanical Turk (AMT)... we use the Twitter Corpus (Ritter et al., 2011), as such models are pre-trained and readily available. |
| Dataset Splits | Yes | Table 1: Statistics of the dialogue response evaluation dataset. Each example is in the form (context, model response, reference response, human score)... # Training examples 2,872 # Validation examples 616 # Test examples 616 |
| Hardware Specification | Yes | We show the evaluation time on the test set for ADEM on both CPU and a Titan X GPU (using Theano, without cud NN) in Table 11. |
| Software Dependencies | No | The paper mentions software tools like "Adam" and "Theano" but does not provide specific version numbers for these or any other software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | For training VHRED, we use a context embedding size of 2000. ...Our best ADEM model used γ = 0.02, a = 0.01, and b = 16. For ADEM with tweet2vec embeddings, we did a similar hyperparameter searched, and used n = 150, γ = 0.01, a = 0.01, and b = 16. ...we drop words in the decoder with a fixed rate of 25%, and we anneal the KL-divergence term linearly from 0 to 1 over the first 60,000 batches. |