Wizard of Wikipedia: Knowledge-Powered Conversational Agents

Authors: Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, Jason Weston

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To that end we collect and release a large dataset with conversations directly grounded with knowledge retrieved from Wikipedia. We then design architectures capable of retrieving knowledge, reading and conditioning on it, and finally generating natural responses. Our best performing dialogue models are able to conduct knowledgeable discussions on open-domain topics as evaluated by automatic metrics and human evaluations, while our new benchmark allows for measuring further improvements in this important research direction.
Researcher Affiliation Industry Emily Dinan , Stephen Roller , Kurt Shuster , Angela Fan, Michael Auli, Jason Weston Facebook AI Research {edinan,roller,kshuster,angelafan,michaelauli,jase}@fb.com
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Our new benchmark, publicly in Parl AI (http:// parl.ai/projects/wizard of wikipedia/), aims to encourage and measure further improvements in this important research direction.
Open Datasets Yes The final dialogue dataset we collect consists of 22,311 dialogues with 201,999 turns, which we divide into 166,787 for train, 17,715 for validation, and 17,497 for test. The test set is split into two subsets, Test Seen and Test Unseen. Test Seen contains 533 overlapping topics with the training set with new dialogues about those topics. Test Unseen consists of 58 topics never seen before in train or validation. Overall data statistics can be found in Table 1, and further statistics and examples of collected conversations in Appendix A.2.
Dataset Splits Yes The final dialogue dataset we collect consists of 22,311 dialogues with 201,999 turns, which we divide into 166,787 for train, 17,715 for validation, and 17,497 for test.
Hardware Specification No The paper does not explicitly describe the specific hardware used to run its experiments.
Software Dependencies No The paper mentions BPE encoding (Sennrich et al., 2016) and refers to Transformer architectures (Vaswani et al., 2017) and Memory Network architectures (Sukhbaatar et al., 2015) as foundational components, but does not specify software versions for implementation or other key software dependencies.
Experiment Setup Yes We employ a beam search of 5 to select our best response. All generative models employ BPE encoding (Sennrich et al., 2016), which we found effective at enabling generators to copy rare words from Wikipedia sentences (Fan et al., 2018). ... We train the model to minimize the negative log-likelihood of the response utterance. We can add additional supervision by forcing the knowledge selection to correctly choose the same knowledge candidate as the human wizard in the training set by adding an additional crossentropy loss over the knowledge attention, modulated by a weight λ: L = (1 λ)LNLL + λLknowledge. ... We can also improve performance of the decoder by employing knowledge dropout (K.D.), wherein we artificially prevent the model from attending to knowledge a fraction of the time during training.