UBAR: Towards Fully End-to-End Task-Oriented Dialog System with GPT-2

Authors: Yunyi Yang, Yunhao Li, Xiaojun Quan14230-14238

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on the Multi WOZ datasets show that UBAR achieves state-of-the-art performances in multiple settings, improving the combined score of response generation, policy optimization, and end-to-end modeling by 4.7, 3.5, and 9.4 points respectively.
Researcher Affiliation Academia Yunyi Yang,Yunhao Li, Xiaojun Quan* Sun Yat-sen University {yangyy37, liyh355}@mail2.sysu.edu.cn, quanxj3@mail.sysu.edu.cn
Pseudocode No No explicitly labeled pseudocode or algorithm blocks were found.
Open Source Code Yes Code and technical appendix available at https://github.com/Tony Nemo/UBAR-Multi WOZ
Open Datasets Yes Multi WOZ 2.0 (Budzianowski et al. 2018) is a large-scale human-to-human multi-domain task-oriented dialog dataset consisting of 8438 dialogues spanning over seven domains (attraction, hospital, police, hotel, restaurant, taxi, train). It provides additional validation set and test set each of 1000 dialogues, excluding hospital and police.
Dataset Splits Yes Multi WOZ 2.0 (Budzianowski et al. 2018) is a large-scale human-to-human multi-domain task-oriented dialog dataset consisting of 8438 dialogues spanning over seven domains (attraction, hospital, police, hotel, restaurant, taxi, train). It provides additional validation set and test set each of 1000 dialogues, excluding hospital and police.
Hardware Specification No No specific hardware details such as CPU/GPU models, memory, or cloud instance types used for running experiments were found.
Software Dependencies No The paper mentions 'Hugging Face s Transformers (Wolf et al. 2019) and Distil GPT2 (Sanh et al. 2019)' but does not provide specific version numbers for these software components.
Experiment Setup Yes The model is trained on session-level sequences with a max sequence length of 1024. Sequences that exceed 1024 tokens are pre-truncated. We use the Adam W optimizer and standard greedy decoding method with temperature of 0.7. We select the best performing model on validation set through hyperparameters search of learning rate and batch size, then evaluate on test set to get the final results.