Thrust: Adaptively Propels Large Language Models with External Knowledge

Authors: Xinran Zhao, Hongming Zhang, Xiaoman Pan, Wenlin Yao, Dong Yu, Jianshu Chen

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Thrust is a good measurement of PTLM models instance-level knowledgeability. Moreover, we can achieve higher cost-efficiency with Thrust score as the retrieval indicator than the naive usage of external knowledge on 88% of the evaluated tasks with 26% average performance improvement. Such findings shed light on the real-world practice of knowledge-enhanced LMs with a limited knowledge-seeking budget due to computation latency or costs. To comprehensively understand the effectiveness of Thrust, we conduct experiments on diverse NLP tasks.
Researcher Affiliation Collaboration 1Tencent AI Lab, Bellevue, 2Language Technologies Institute, Carnegie Mellon University
Pseudocode No The paper describes the steps of its method in text and mathematical formulas but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/colinzhaoust/thrust_neurips2023.
Open Datasets Yes Multiple-choice classification. For MC classification, each query q includes a sentence or a question and requires models to select the correct answer from a set of candidates. Specifically, (i) AGNews [62] asks the model to classify a piece of news into political, sports, business, or technology. ... Open-domain QA. The involved datasets are Hotpot QA [58], Natural Questions (NQ) [27], Web Questions [2], Curated TREC [1], and Trivia QA [19]. We use Wikipedia paragraphs retrieved by DPR as the external knowledge as a common practice [59], except for Hotpot QA, where we use the passages that the queries are generated from as a gold knowledge resource.
Dataset Splits No The paper states 'We sample 200 data points from each dataset to conduct the clustering step of Thrust', which is for setting up the metric. However, it does not specify explicit training, validation, and test dataset splits (e.g., percentages or exact counts for each split) for the main experimental evaluation.
Hardware Specification Yes We conduct our experiments on a machine with 8 Nvidia P40 (24G) GPUs with CUDA 11 installed.
Software Dependencies No The paper mentions using 'Scikit-learn package' and 'Hugginface transformers package' but does not specify particular version numbers for these software dependencies, which is required for reproducibility. While CUDA 11 is mentioned, it's an environment component, not a key software dependency with a specific version for the model itself.
Experiment Setup Yes For hyperparameters of the inference models, for the QA task, we set the maximum knowledge length as 480 tokens to ensure that query sentences stay in the input. The generated answer for QA tasks for all the models is typically within 30 tokens. ... We run all experiments 3 times and report the averaged performance in the main content.