An Oral Exam for Measuring a Dialog System’s Capabilities

Authors: David Cohen, Ian Lane

AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present results from one instantiation of this test being performed on two publicly-accessible dialog systems and a human, and show that the suggested metrics do provide useful insights into the relative strengths and weaknesses of these systems.
Researcher Affiliation Academia David Cohen Carnegie Mellon University NASA Research Park, Bldg 23 Moffett Field, CA 94035 david.cohen@sv.cmu.edu Ian Lane Carnegie Mellon University NASA Research Park, Bldg 23 Moffett Field, CA 94035 lane@cs.cmu.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code No The paper does not provide concrete access to source code for the methodology described in this paper. No specific repository link, explicit code release statement, or code in supplementary materials is mentioned.
Open Datasets No The paper evaluates two existing publicly-accessible dialog systems (Google Now, Cleverbot) and a human. It defines a 'test domain' (Table 2) but does not provide concrete access information (link, DOI, repository, or formal citation) for a specific publicly available or open dataset used for training or evaluation in a reproducible manner. The paper focuses on evaluating pre-existing systems rather than using or providing a dataset for model training.
Dataset Splits No The paper describes an evaluation process with human evaluators interacting with pre-existing systems, but it does not specify explicit training, validation, and test dataset splits in the conventional machine learning sense for reproduction.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes We perform evaluations on two computer dialog systems and one human within a small but diverse test domain (shown in Table 2). The two computer dialog systems were Google Now, which uses a speech input interface, and Cleverbot, which uses a typed interface. The human evaluation was performed anonymously over Internet chat, where the test subject did not have access to Internet search. We use six evaluators, who evaluate the test systems in counter-balanced order. Each evaluator performed two elicitation trials per capability per test system. Evaluators performed the test on one system completely before moving on to the next system, and evaluated both programs and the human in counterbalanced order in one 1.5 hour session.