Distilling Knowledge from Reader to Retriever for Question Answering

Authors: Gautier Izacard, Edouard Grave

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on question answering, obtaining state-of-the-art results. Our method is inspired by knowledge distillation (Hinton et al., 2015), and uses the reader model to obtain synthetic labels to train the retriever model. In this section we evaluate the student-teacher training procedure from the previous section. We show that we obtain competitive performance without strong supervision for support documents. We perform experiments on Trivia QA (Joshi et al., 2017) and Natural Questions (Kwiatkowski et al., 2019), two standard benchmarks for open-domain question answering.
Researcher Affiliation Collaboration Gautier Izacard1,2,3, Edouard Grave1 1Facebook AI Research, 2 Ecole normale sup erieure, PSL University, 3Inria {gizacard|egrave}@fb.com
Pseudocode No The paper describes the iterative training procedure in four steps using numbered lists, but it does not present them as structured pseudocode or an algorithm block.
Open Source Code Yes Our code is available at: github.com/facebookresearch/Fi D.
Open Datasets Yes We perform experiments on Trivia QA (Joshi et al., 2017) and Natural Questions (Kwiatkowski et al., 2019), two standard benchmarks for open-domain question answering. We also evaluate on Narrative Questions (Koˇcisk y et al., 2018), using a publicly available preprocessed version.1
Dataset Splits Yes Following the setting from Lee et al. (2019); Karpukhin et al. (2020), we use the original evaluation set as test set, and keep 10% of the training data for validation.
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments, such as GPU models, CPU specifications, or memory.
Software Dependencies No The paper mentions several software components and models (BERT base model, T5 base model, Apache Lucene, SpaCy) but does not provide specific version numbers for these software dependencies, which is required for reproducibility.
Experiment Setup Yes The reader is trained for 10k gradient steps with a constant learning rate of 10 4, and the best model is selected based on the validation performance. The retriever is trained with a constant learning rate of 5 10 5 until the performance saturates. More details on the hyperparameters and the training procedure are reported in Appendix A.2. Appendix A.2 contains Table 6: Hyperparameters for retriever and reader training, which specifies Number of parameters, Number of heads, Number of layers, Hidden size, Batch size, Dropout, Learning rate schedule, Peak learning rate, and Gradient clipping.