Wizard of Wikipedia: Knowledge-Powered Conversational Agents
Authors: Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, Jason Weston
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To that end we collect and release a large dataset with conversations directly grounded with knowledge retrieved from Wikipedia. We then design architectures capable of retrieving knowledge, reading and conditioning on it, and finally generating natural responses. Our best performing dialogue models are able to conduct knowledgeable discussions on open-domain topics as evaluated by automatic metrics and human evaluations, while our new benchmark allows for measuring further improvements in this important research direction. |
| Researcher Affiliation | Industry | Emily Dinan , Stephen Roller , Kurt Shuster , Angela Fan, Michael Auli, Jason Weston Facebook AI Research {edinan,roller,kshuster,angelafan,michaelauli,jase}@fb.com |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Our new benchmark, publicly in Parl AI (http:// parl.ai/projects/wizard of wikipedia/), aims to encourage and measure further improvements in this important research direction. |
| Open Datasets | Yes | The final dialogue dataset we collect consists of 22,311 dialogues with 201,999 turns, which we divide into 166,787 for train, 17,715 for validation, and 17,497 for test. The test set is split into two subsets, Test Seen and Test Unseen. Test Seen contains 533 overlapping topics with the training set with new dialogues about those topics. Test Unseen consists of 58 topics never seen before in train or validation. Overall data statistics can be found in Table 1, and further statistics and examples of collected conversations in Appendix A.2. |
| Dataset Splits | Yes | The final dialogue dataset we collect consists of 22,311 dialogues with 201,999 turns, which we divide into 166,787 for train, 17,715 for validation, and 17,497 for test. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used to run its experiments. |
| Software Dependencies | No | The paper mentions BPE encoding (Sennrich et al., 2016) and refers to Transformer architectures (Vaswani et al., 2017) and Memory Network architectures (Sukhbaatar et al., 2015) as foundational components, but does not specify software versions for implementation or other key software dependencies. |
| Experiment Setup | Yes | We employ a beam search of 5 to select our best response. All generative models employ BPE encoding (Sennrich et al., 2016), which we found effective at enabling generators to copy rare words from Wikipedia sentences (Fan et al., 2018). ... We train the model to minimize the negative log-likelihood of the response utterance. We can add additional supervision by forcing the knowledge selection to correctly choose the same knowledge candidate as the human wizard in the training set by adding an additional crossentropy loss over the knowledge attention, modulated by a weight λ: L = (1 λ)LNLL + λLknowledge. ... We can also improve performance of the decoder by employing knowledge dropout (K.D.), wherein we artificially prevent the model from attending to knowledge a fraction of the time during training. |