A Simple Language Model for Task-Oriented Dialogue
Authors: Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, Semih Yavuz, Richard Socher
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Simple TOD improves over the prior stateof-the-art in joint goal accuracy for dialogue state tracking, and our analysis reveals robustness to noisy annotations in this setting. Simple TOD also improves the main metrics used to evaluate action decisions and response generation in an end-to-end setting: inform rate by 8.1 points, success rate by 9.7 points, and combined score by 7.2 points. |
| Researcher Affiliation | Industry | Ehsan Hosseini-Asl ehosseiniasl@salesforce.com Salesforce Research Bryan Mc Cann bmccann@salesforce.com Salesforce Research Chien-Sheng Wu wu.jason@salesforce.com Salesforce Research Semih Yavuz syavuz@salesforce.com Salesforce Research Richard Socher rsocher@salesforce.com Salesforce Research |
| Pseudocode | No | The paper describes algorithms but does not provide structured pseudocode blocks. |
| Open Source Code | Yes | A list of discovered noisy annotations in Multi WOZ 2.1 alongside a cleaned version of the test set, code for training and evaluation, are provided at https://github.com/ salesforce/simpletod |
| Open Datasets | Yes | We evaluate on the Multi-domain Wizard-of-Oz (Multi WOZ) [7], a large-scale, multi-domain dialogue dataset of human-human conversations. It contains 10438 multi-turn dialogues with 13.68 average turns, spanning over seven domains (restaurant, train, attraction, hotel, taxi, hospital, police). Police and hospital domains are excluded from evaluation, since they do not have valid/test splits. This leaves 30 domain-slot pairs for the remaining five domain with 4,500 possible values. Simple TOD is trained on delexicalized system responses according to the pre-processing explained in [7]. Recently, [14] released Multi WOZ 2.1 which removes some noisy state values from dialogue state (belief state) tracking annotations. |
| Dataset Splits | Yes | Police and hospital domains are excluded from evaluation, since they do not have valid/test splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The input to the model is tokenized with pretrained BPE codes [44] associated with Distil GPT2 [43], a distilled version of GPT-2 [39]. According to experimental results, Experiments for Simple TOD use default hyperparameters for GPT-2 and Distil GPT2 in Huggingface Transformers[52]. |
| Experiment Setup | No | Experiments for Simple TOD use default hyperparameters for GPT-2 and Distil GPT2 in Huggingface Transformers[52]. Sequences longer than 1024 tokens are truncated. |