Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive Tasks
Authors: Minki Kang, Seanie Lee, Jinheon Baek, Kenji Kawaguchi, Sung Ju Hwang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show that KARD significantly improves the performance of small T5 and GPT models on the challenging knowledge-intensive reasoning datasets, namely Med QA-USMLE, Strategy QA, and Openbook QA. Notably, our method makes the 250M T5 models achieve superior performance against the fine-tuned 3B models, having 12 times larger parameters, on both Med QA-USMLE and Strategy QA benchmarks. |
| Researcher Affiliation | Collaboration | Minki Kang1,2 , Seanie Lee2, Jinheon Baek2, Kenji Kawaguchi3, Sung Ju Hwang2,4 1KRAFTON, 2KAIST, 3National University of Singapore, 4Deep Auto.ai |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. The methods are described in narrative text with equations and figures. |
| Open Source Code | Yes | Work done at AITRICS. Code is available at https://github.com/Nardien/KARD. |
| Open Datasets | Yes | As our primary benchmark, we use the medical multiple-choice question dataset Med QA-USMLE [23]. To further validate our approach, we employ Strategy QA [14] dataset, which involves 2,780 yes/no questions that demand sophisticated multi-step reasoning skills and the ability to gather supporting evidence from various domains. We additionally validate our approach on commonsense reasoning with Openbook QA [39] dataset, which consists of 5,957 elementary-level science questions with 4 multiple-choice options. |
| Dataset Splits | Yes | For the train-test split of dataset, we use the official split for Med QA-USMLE [23] and Openbook QA [39]. For strategy QA, we split the training set into 7 : 3 ratio to build the in-house test set following Ho et al. [17]. |
| Hardware Specification | Yes | Each model utilizes a maximum of 96GB GPU memory with 4 NVIDIA TITAN RTX GPUs for fine-tuning. |
| Software Dependencies | No | The paper mentions software like 'pyserini library' and uses 'Adam W optimizer' but does not provide specific version numbers for these or other key software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | For all our experiments, we fine-tune the small language model for 3 epochs with a batch size of 32 using the Adam W optimizer [36] and a learning rate of 10 4. Each model utilizes a maximum of 96GB GPU memory with 4 NVIDIA TITAN RTX GPUs for fine-tuning. In the Strategy QA and Openbook QA experiments, we use the T5 model instead of Flan-T5 to prevent any potential data contamination with the corresponding test set, as Flan-T5 is fine-tuned on both datasets during instruction tuning. For the number of documents used for knowledge augmentation during KARD training, we set k = 1 for Med QA-USMLE and Strategy QA and k = 3 for Openbook QA; specifically, we append documents retrieved from the retriever ρ along with each training sample to construct the input for training. |