reproducibilityindex.ai

A Simple Language Model for Task-Oriented Dialogue

Authors: Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, Semih Yavuz, Richard Socher

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Simple TOD improves over the prior stateof-the-art in joint goal accuracy for dialogue state tracking, and our analysis reveals robustness to noisy annotations in this setting. Simple TOD also improves the main metrics used to evaluate action decisions and response generation in an end-to-end setting: inform rate by 8.1 points, success rate by 9.7 points, and combined score by 7.2 points.
Researcher Affiliation	Industry	Ehsan Hosseini-Asl ehosseiniasl@salesforce.com Salesforce Research Bryan Mc Cann bmccann@salesforce.com Salesforce Research Chien-Sheng Wu wu.jason@salesforce.com Salesforce Research Semih Yavuz syavuz@salesforce.com Salesforce Research Richard Socher rsocher@salesforce.com Salesforce Research
Pseudocode	No	The paper describes algorithms but does not provide structured pseudocode blocks.
Open Source Code	Yes	A list of discovered noisy annotations in Multi WOZ 2.1 alongside a cleaned version of the test set, code for training and evaluation, are provided at https://github.com/ salesforce/simpletod
Open Datasets	Yes	We evaluate on the Multi-domain Wizard-of-Oz (Multi WOZ) [7], a large-scale, multi-domain dialogue dataset of human-human conversations. It contains 10438 multi-turn dialogues with 13.68 average turns, spanning over seven domains (restaurant, train, attraction, hotel, taxi, hospital, police). Police and hospital domains are excluded from evaluation, since they do not have valid/test splits. This leaves 30 domain-slot pairs for the remaining ﬁve domain with 4,500 possible values. Simple TOD is trained on delexicalized system responses according to the pre-processing explained in [7]. Recently, [14] released Multi WOZ 2.1 which removes some noisy state values from dialogue state (belief state) tracking annotations.
Dataset Splits	Yes	Police and hospital domains are excluded from evaluation, since they do not have valid/test splits.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU models, memory) used for running the experiments.
Software Dependencies	No	The input to the model is tokenized with pretrained BPE codes [44] associated with Distil GPT2 [43], a distilled version of GPT-2 [39]. According to experimental results, Experiments for Simple TOD use default hyperparameters for GPT-2 and Distil GPT2 in Huggingface Transformers[52].
Experiment Setup	No	Experiments for Simple TOD use default hyperparameters for GPT-2 and Distil GPT2 in Huggingface Transformers[52]. Sequences longer than 1024 tokens are truncated.