reproducibilityindex.ai

Dialog Inpainting: Turning Documents into Dialogs

Authors: Zhuyun Dai, Arun Tejasvi Chaganty, Vincent Y Zhao, Aida Amini, Qazi Mamunur Rashid, Mike Green, Kelvin Guu

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By applying this approach to passages from Wikipedia and the web, we produce Wiki Dialog and Web Dialog, two datasets totalling 19 million diverse information-seeking dialogs 1,000x larger than the largest existing Conv QA dataset. Furthermore, human raters judge the answer adequacy and conversationality of Wiki Dialog to be as good or better than existing manually-collected datasets. Remarkably, our approach shows strong zero-shot capability, generating high quality synthetic data without using any in-domain Conv QA data. Using our inpainted data to pre-train Conv QA retrieval systems, we signiﬁcantly advance state-of-the-art across three benchmarks (QRe CC, OR-Qu AC, TREC CAs T) yielding up to 40% relative gains on standard evaluation metrics.
Researcher Affiliation	Industry	1Google Inc., Mountain View, USA.
Pseudocode	No	The paper describes methods textually and mathematically but does not include pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper states: 'We released Wiki Dialog at https://github.com/ google-research/dialog-inpainting', but this link points to the generated dataset, not the source code for the dialog inpainting methodology itself.
Open Datasets	Yes	We use four open-domain conversational QA retrieval benchmarks: OR-Qu AC (Qu et al., 2020), TREC CAs T-19 (Byrne et al., 2019), TREC CAs T-20 (Dalton et al., 2020), and QRe CC (Anantha et al., 2021).
Dataset Splits	Yes	During ﬁne-tuning, we separately train retrievers and rerankers on OR-Qu AC and QRe CC, using their validation sets to select checkpoints. Because CAs T19 and CAs T20 are extremely small datasets and do not include a training split, we do not ﬁne-tune dual-enocoder retrievers on these datasets... We follow Yu et al. (2021) and use 5-fold cross-validation to ﬁnetune rerankers on CAs T19 and CAs T20: for each fold, we split the data into 5 splits based on dialogs, train a reranker on 3 splits of the data, select a checkpoint on one split and test on the remaining split.
Hardware Specification	Yes	Unless otherwise speciﬁed, all our dialog inpainters are initialized from T5-XXL (11B parameters)14 and ﬁnetuned using 64 TPU v3 chips 15 with constant learning rate 0.01, dropout rate 0.1 and batch size 128.
Software Dependencies	No	The paper mentions using T5 and implementing in JAX but does not specify version numbers for these or any other software libraries required for replication.
Experiment Setup	Yes	Unless otherwise speciﬁed, all our dialog inpainters are initialized from T5-XXL (11B parameters)14 and ﬁnetuned using 64 TPU v3 chips 15 with constant learning rate 0.01, dropout rate 0.1 and batch size 128. For pre-training on our inpainted datasets, we used a softmax temperature τ of 0.01, batch size 2048, and dropout rate 0.1. The models were trained with Adafactor optimizer with learning rate 1e 3 and 1k warm up steps.