Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation
Authors: Thomas Merth, Qichen Fu, Mohammad Rastegari, Mahyar Najibi
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the capability of our method to simultaneously enhance time efficiency across a variety of questionanswering benchmarks using multiple pre-trained LLMs. Furthermore, our technique significantly improves accuracy when the retrieved context is large relative the context the model was trained on. We perform experiments on three families of large language models, namely Open ELM (Mehta et al., 2024), BLOOMZ (Muennighoff et al., 2023), and MPT (Mosaic ML NLP Team, 2023). We leverage the publicly available Natural Questions-Open (Liu et al., 2023a) and Mu Si Que (Trivedi et al., 2022) datasets. |
| Researcher Affiliation | Industry | 1Apple, Cupertino, CA, USA 2Meta, Menlo x Park, CA, USA (*Work done while at Apple). Correspondence to: T. Merth <tmerth@apple.com>, Q. Fu <qfu22@apple.com>, M. Rastegari <mrastegari@meta.com>, M. Najibi <najibi@apple.com>. |
| Pseudocode | Yes | Please refer to Algorithm 3 for an algorithmic formalization. |
| Open Source Code | Yes | For reproducibility, our implementation can be found at https://github.com/apple/ml-superposition-prompting. |
| Open Datasets | Yes | We leverage the publicly available Natural Questions-Open (Liu et al., 2023a) and Mu Si Que (Trivedi et al., 2022) datasets. |
| Dataset Splits | Yes | We validate our approach on the dev split of Mu Si Que-Ans (reporting Answer EM and F1). We follow the same experimental setup as Liu et al., 2023a, including the same preprocessing and evaluation methodology for the 20 document setting (reporting Best EM Subspan, or Accuracy for short). |
| Hardware Specification | Yes | In Table 5 and Table 7, we present measurements of the compared methods in a realistic server deployment scenario (an NVIDIA A100 80GB). |
| Software Dependencies | No | We use the fvcore (facebookresearch, 2024) package to compute theoretical floating point operation (FLOP) counts for various inference settings. Our CUDA implementation is written in pure Py Torch. While these software components are mentioned, specific version numbers for them (e.g., PyTorch 1.x) are not provided. |
| Experiment Setup | Yes | We use greedy autoregressive decoding in all experiments, and randomize the order of documents to prevent any systematic bias possible due to location of the gold documents ( a la Liu et al., 2023a). We introduce the hyperparameter superposition factor as a parameter to interpolate between a fully superimposed and fully classical prompt. Here, we sweep values for top-k for our method, where k are the number of documents retained for generating the answer (full table results are provided in Table 5). |