Zero-Resource Knowledge-Grounded Dialogue Generation

Authors: Linxiao Li, Can Xu, Wei Wu, YUFAN ZHAO, Xueliang Zhao, Chongyang Tao

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments with benchmarks of knowledge-grounded dialogue generation that are constructed by crowd-sourcing. Evaluation results in terms of both automatic metrics and human judgment indicate that our model not only achieves comparable performance with the state-of-the-art model that is learned from crowd-sourced training sets, but also exhibits a good generalization ability over different topics and different datasets.
Researcher Affiliation Collaboration Linxiao Li Peking University lilinxiao@pku.edu.cn Can Xu Microsoft STCA caxu@microsoft.com Wei Wu Meituan wuwei19850318@gmail.com Yufan Zhao Microsoft STCA yufzhao@microsoft.com Xueliang Zhao Peking University xl.zhao@pku.edu.cn Chongyang Tao Peking University chongyangtao@pku.edu.cn
Pseudocode Yes Algorithm 1 Optimization Algorithm
Open Source Code Yes Dataset and codes are publicly available at https://github.com/nlpxucan/ZRKGC
Open Datasets Yes We test the proposed method on benchmarks of knowledge-grounded dialogue generation, including Wizard of Wikipedia (Wizard) [10], Topical-Chat (TC) [16], and CMU Document Grounded Conversations (CMU_Do G) [55]. We build the knowledge corpus with a Wikipedia dump,3 where text is extracted with an open source tool4 and split into sentences using NLTK.5 The dialogue corpus is constructed from the Reddit Conversation Corpus cleaned by [12]. For CMU_Do G, we use the version shared at https://github. com/lizekang/ITDD. For TC, we utilize the data published in the open source project https://github.com/alexa/alexa-prize-topical-chat-dataset/.
Dataset Splits Yes After the pre-processing, the subset is randomly split into a training set and a validation set with 842,521 and 2,737 dialogues respectively.
Hardware Specification No The paper does not specify the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies No We index the sentences in the knowledge corpus with an open source Lucene.Net, employ the internal ranker of Lucene (basically a BM25 model [31]) as rel( , ), and set the number of retrieved candidates (i.e., l) as 10. The function Sim( , ) in Section 2.2 is defined as Bleu-2 [27]. We choose UNILM Base (110M) and implement the model with the code in https://github.com/microsoft/unilm. We find that replacing DKL(q(Zα) p(Zα C,Zk)) in Eq. 3 with a mean squared error in optimization can enhance model performance, probably because Zα is a continuous variable. The model is trained with a batch size 10, a maximum input length 256, and a maximum output length 40. The threshold λ and the maximum step M in Algorithm 1 are set as 0.2 and 100,000 respectively. The learning rate is set as 0.00003 and the warmup step is set as 1000.
Experiment Setup Yes The model is trained with a batch size 10, a maximum input length 256, and a maximum output length 40. The threshold λ and the maximum step M in Algorithm 1 are set as 0.2 and 100,000 respectively. The learning rate is set as 0.00003 and the warmup step is set as 1000. In training, we evaluate the model per 5,000 steps on the validation set with unigram F1 [10] as a metric. The training procedure will be terminated if we find F1 begins to drop. To draw a fair comparison, we keep the same evaluation procedure with the existing models. During test time, we exploit beam search with a beam size 5.