Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive Tasks

Authors: Minki Kang, Seanie Lee, Jinheon Baek, Kenji Kawaguchi, Sung Ju Hwang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically show that KARD significantly improves the performance of small T5 and GPT models on the challenging knowledge-intensive reasoning datasets, namely Med QA-USMLE, Strategy QA, and Openbook QA. Notably, our method makes the 250M T5 models achieve superior performance against the fine-tuned 3B models, having 12 times larger parameters, on both Med QA-USMLE and Strategy QA benchmarks.
Researcher Affiliation Collaboration Minki Kang1,2 , Seanie Lee2, Jinheon Baek2, Kenji Kawaguchi3, Sung Ju Hwang2,4 1KRAFTON, 2KAIST, 3National University of Singapore, 4Deep Auto.ai
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. The methods are described in narrative text with equations and figures.
Open Source Code Yes Work done at AITRICS. Code is available at https://github.com/Nardien/KARD.
Open Datasets Yes As our primary benchmark, we use the medical multiple-choice question dataset Med QA-USMLE [23]. To further validate our approach, we employ Strategy QA [14] dataset, which involves 2,780 yes/no questions that demand sophisticated multi-step reasoning skills and the ability to gather supporting evidence from various domains. We additionally validate our approach on commonsense reasoning with Openbook QA [39] dataset, which consists of 5,957 elementary-level science questions with 4 multiple-choice options.
Dataset Splits Yes For the train-test split of dataset, we use the official split for Med QA-USMLE [23] and Openbook QA [39]. For strategy QA, we split the training set into 7 : 3 ratio to build the in-house test set following Ho et al. [17].
Hardware Specification Yes Each model utilizes a maximum of 96GB GPU memory with 4 NVIDIA TITAN RTX GPUs for fine-tuning.
Software Dependencies No The paper mentions software like 'pyserini library' and uses 'Adam W optimizer' but does not provide specific version numbers for these or other key software dependencies, which is required for reproducibility.
Experiment Setup Yes For all our experiments, we fine-tune the small language model for 3 epochs with a batch size of 32 using the Adam W optimizer [36] and a learning rate of 10 4. Each model utilizes a maximum of 96GB GPU memory with 4 NVIDIA TITAN RTX GPUs for fine-tuning. In the Strategy QA and Openbook QA experiments, we use the T5 model instead of Flan-T5 to prevent any potential data contamination with the corresponding test set, as Flan-T5 is fine-tuned on both datasets during instruction tuning. For the number of documents used for knowledge augmentation during KARD training, we set k = 1 for Med QA-USMLE and Strategy QA and k = 3 for Openbook QA; specifically, we append documents retrieved from the retriever ρ along with each training sample to construct the input for training.