reproducibilityindex.ai

DialogBERT: Discourse-Aware Response Generation via Learning to Recover and Rank Utterances

Authors: Xiaodong Gu, Kang Min Yoo, Jung-Woo Ha12911-12919

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on three multi-turn conversation datasets show that our approach remarkably outperforms the baselines, such as BART and Dialo GPT, in terms of quantitative evaluation. The human evaluation suggests that Dialog BERT generates more coherent, informative, and human-like responses than the baselines with signiﬁcant margins.
Researcher Affiliation	Collaboration	Xiaodong Gu1, Kang Min Yoo2 and Jung-Woo Ha2 1School of Software, Shanghai Jiao Tong University, China 2NAVER AI Lab, Korea
Pseudocode	No	The paper describes its methods textually and with diagrams but does not include any pseudocode or algorithm blocks.
Open Source Code	No	The paper states: 'We implemented our approach on top of the Huggingface Transformer repository (Wolf et al. 2019a).' but does not provide a link to its own source code for the described methodology.
Open Datasets	Yes	Weibo is a large-scale multi-turn conversation benchmark, introduced in the NLPCC2018 task51. The dataset originates from the Sina Weibo (microblog)... Multi WOZ2 is a dataset... (Budzianowski et al. 2018). Daily Dialog3 is a popular dataset... (Shen et al. 2018; Gu et al. 2019). Footnotes provide URLs: 1http://tcci.ccf.org.cn/conference/2018/dldoc/taskgline05.pdf 2https://github.com/budzianowski/multiwoz 3http://yanran.li/dailydialog
Dataset Splits	Yes	Dataset: Weibo (train samples 15,481,891, valid samples 89,994, test samples 84,052); Multi WOZ (train samples 106,794, valid samples 12,902, test samples 12,914); Daily Dialog (train samples 76,052, valid samples 7,069, test samples 6,740).
Hardware Specification	Yes	Experiments took place on a machine with Ubuntu 16.04 and an NVidia Tesla P40 GPU.
Software Dependencies	No	The paper states: 'We implemented our approach on top of the Huggingface Transformer repository (Wolf et al. 2019a).' However, it does not specify a version number for the Huggingface Transformer library or other software dependencies.
Experiment Setup	Yes	We limit the number of utterances in each context to 7 (Adiwardana et al. 2020) and the utterance length to 30 words. All of the experiments use the default BERT tokenizer (e.g., bert-base-uncased for English datasets). All models were optimized with Adam W (Loshchilov and Hutter 2018) optimizer using an initial learning rate of 5e 5. We used the adaptive learning-rate scheduler with 5,000 warmup steps.