DialogBERT: Discourse-Aware Response Generation via Learning to Recover and Rank Utterances

Authors: Xiaodong Gu, Kang Min Yoo, Jung-Woo Ha12911-12919

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on three multi-turn conversation datasets show that our approach remarkably outperforms the baselines, such as BART and Dialo GPT, in terms of quantitative evaluation. The human evaluation suggests that Dialog BERT generates more coherent, informative, and human-like responses than the baselines with significant margins.
Researcher Affiliation Collaboration Xiaodong Gu1, Kang Min Yoo2 and Jung-Woo Ha2 1School of Software, Shanghai Jiao Tong University, China 2NAVER AI Lab, Korea
Pseudocode No The paper describes its methods textually and with diagrams but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper states: 'We implemented our approach on top of the Huggingface Transformer repository (Wolf et al. 2019a).' but does not provide a link to its own source code for the described methodology.
Open Datasets Yes Weibo is a large-scale multi-turn conversation benchmark, introduced in the NLPCC2018 task51. The dataset originates from the Sina Weibo (microblog)... Multi WOZ2 is a dataset... (Budzianowski et al. 2018). Daily Dialog3 is a popular dataset... (Shen et al. 2018; Gu et al. 2019). Footnotes provide URLs: 1http://tcci.ccf.org.cn/conference/2018/dldoc/taskgline05.pdf 2https://github.com/budzianowski/multiwoz 3http://yanran.li/dailydialog
Dataset Splits Yes Dataset: Weibo (train samples 15,481,891, valid samples 89,994, test samples 84,052); Multi WOZ (train samples 106,794, valid samples 12,902, test samples 12,914); Daily Dialog (train samples 76,052, valid samples 7,069, test samples 6,740).
Hardware Specification Yes Experiments took place on a machine with Ubuntu 16.04 and an NVidia Tesla P40 GPU.
Software Dependencies No The paper states: 'We implemented our approach on top of the Huggingface Transformer repository (Wolf et al. 2019a).' However, it does not specify a version number for the Huggingface Transformer library or other software dependencies.
Experiment Setup Yes We limit the number of utterances in each context to 7 (Adiwardana et al. 2020) and the utterance length to 30 words. All of the experiments use the default BERT tokenizer (e.g., bert-base-uncased for English datasets). All models were optimized with Adam W (Loshchilov and Hutter 2018) optimizer using an initial learning rate of 5e 5. We used the adaptive learning-rate scheduler with 5,000 warmup steps.