Retrieval meets Long Context Large Language Models

Authors: Peng Xu, Wei Ping, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina, Mohammad Shoeybi, Bryan Catanzaro

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we answer these questions by studying both solutions using two state-of-the-art pretrained LLMs, i.e., a proprietary 43B GPT and Llama2-70B. Perhaps surprisingly, we find that LLM with 4K context window using simple retrieval-augmentation at generation can achieve comparable performance to finetuned LLM with 16K context window via positional interpolation on long context tasks, while taking much less computation. More importantly, we demonstrate that retrieval can significantly improve the performance of LLMs regardless of their extended context window sizes.
Researcher Affiliation Industry Peng Xu , Wei Ping , Xianchao Wu, Lawrence Mc Afee Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina Mohammad Shoeybi, Bryan Catanzaro NVIDIA {pengx, wping}@nvidia.com
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide a link or explicit statement about releasing open-source code for its methodology.
Open Datasets Yes Specifically, we include four datasets from the validation set of the Scroll benchmark (Shaham et al., 2022). QMSum (QM) (Zhong et al., 2021) is a query-based summarization dataset... Qasper (QASP) (Dasigi et al., 2021) is a question answering dataset... Narrative QA (NQA) (Koˇciský et al., 2018) is an established question answering dataset... Qu ALITY (QLTY) (Pang et al., 2022) is a question answering dataset... We take another three datasets from Long Bench (Bai et al., 2023). Hotpot QA (HQA) (Yang et al., 2018) is a Wikipedia-based question-answer dataset... Mu Si Que (MSQ) (Trivedi et al., 2022) is another multi-hop question answering dataset... Multi Field QA-en (MFQA) (Bai et al., 2023) was manually curated...
Dataset Splits Yes Specifically, we include four datasets from the validation set of the Scroll benchmark (Shaham et al., 2022).
Hardware Specification No The paper mentions GPUs generally but does not specify any particular GPU model (e.g., NVIDIA A100, Tesla V100) or other hardware details used for their experiments.
Software Dependencies No The paper refers to specific models and techniques (e.g., Ro PE embeddings, Flash Attention, Dragon, Contriever, Open AI embedding) but does not list specific version numbers for any software libraries or dependencies used in their experiments.
Experiment Setup Yes We extend the 4K context window to 16K for GPT-43B. For Llama2, we extend its 4K context window to 32k for Llama2-7B and both 16K and 32K for Llama2-70B. We follow Chen et al. (2023) and finetune both LLMs on the Pile dataset (Gao et al., 2021) with batch size as 128, constant learning rate of 5e-6 to adapt the position embeddings. We finetune the LLM by taking the loss only on the {Answer} part with batch size 128 and learning rate of 5e-6 for 1000 steps.