Accelerating Iterative Retrieval-augmented Language Model Serving with Speculation
Authors: Zhihao Zhang, Alan Zhu, Lijie Yang, Yihua Xu, Lanting Li, Phitchaya Mangpo Phothilimthana, Zhihao Jia
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5. Evaluation 5.1. Experimental Setups We describe our experimental setups in this section, including language models, downstream datasets, retrievers, and the implementation details of the baseline, as well as our approach. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University 2University of California, Berkeley 3Google Deep Mind. |
| Pseudocode | Yes | Algorithm 1 Ra LMSpec Pipeline. |
| Open Source Code | Yes | Evaluation code is publicly available at https://github.com/JackFram/ralm-sys |
| Open Datasets | Yes | We thus include four QA datasets in our experiments: Wiki-QA, Web Questions, Natural Question, and Trivia-QA (Yang et al., 2015; Berant et al., 2013; Kwiatkowski et al., 2019; Joshi et al., 2017). |
| Dataset Splits | No | The paper mentions using datasets for evaluation but does not specify exact train/validation/test splits with percentages, sample counts, or references to predefined splits. |
| Hardware Specification | Yes | We use the VM.GPU.A10 instance on the Oracle cloud, which contains one A10 GPU and 15 CPUs for models that can fit into a single device. For larger models (LLa MA-2-70B) we use instances with four A100-80G and 20 CPUs. |
| Software Dependencies | No | The paper mentions software like Python, Pyserini, and FAISS but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | For both Ra LMSpec and Ra LMSeq, we set the maximum input prompt length to be 512 tokens and the maximum generation length to be 128 tokens. For document-level iterative Ra LM serving, the maximum length of the retrieved document chunk is set to 256 as in Ram et al. (2023). When OS3 is disabled, Ra LMSpec uses a constant speculation stride s = 3. Whenever OS3 is enabled, Ra LMSpec initializes the speculation stride with s = 1 and lets the scheduler adapt onwards. In all our experiments, we set the window size w = 5 and γmax = 0.6 for estimating γ. For prefetching, we use a prefetch size of 20. We also test with a prefetch size of 256 for the ablation study. |