reproducibilityindex.ai

UBAR: Towards Fully End-to-End Task-Oriented Dialog System with GPT-2

Authors: Yunyi Yang, Yunhao Li, Xiaojun Quan14230-14238

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on the Multi WOZ datasets show that UBAR achieves state-of-the-art performances in multiple settings, improving the combined score of response generation, policy optimization, and end-to-end modeling by 4.7, 3.5, and 9.4 points respectively.
Researcher Affiliation	Academia	Yunyi Yang,Yunhao Li, Xiaojun Quan* Sun Yat-sen University {yangyy37, liyh355}@mail2.sysu.edu.cn, quanxj3@mail.sysu.edu.cn
Pseudocode	No	No explicitly labeled pseudocode or algorithm blocks were found.
Open Source Code	Yes	Code and technical appendix available at https://github.com/Tony Nemo/UBAR-Multi WOZ
Open Datasets	Yes	Multi WOZ 2.0 (Budzianowski et al. 2018) is a large-scale human-to-human multi-domain task-oriented dialog dataset consisting of 8438 dialogues spanning over seven domains (attraction, hospital, police, hotel, restaurant, taxi, train). It provides additional validation set and test set each of 1000 dialogues, excluding hospital and police.
Dataset Splits	Yes	Multi WOZ 2.0 (Budzianowski et al. 2018) is a large-scale human-to-human multi-domain task-oriented dialog dataset consisting of 8438 dialogues spanning over seven domains (attraction, hospital, police, hotel, restaurant, taxi, train). It provides additional validation set and test set each of 1000 dialogues, excluding hospital and police.
Hardware Specification	No	No specific hardware details such as CPU/GPU models, memory, or cloud instance types used for running experiments were found.
Software Dependencies	No	The paper mentions 'Hugging Face s Transformers (Wolf et al. 2019) and Distil GPT2 (Sanh et al. 2019)' but does not provide specific version numbers for these software components.
Experiment Setup	Yes	The model is trained on session-level sequences with a max sequence length of 1024. Sequences that exceed 1024 tokens are pre-truncated. We use the Adam W optimizer and standard greedy decoding method with temperature of 0.7. We select the best performing model on validation set through hyperparameters search of learning rate and batch size, then evaluate on test set to get the ﬁnal results.