Accelerating Iterative Retrieval-augmented Language Model Serving with Speculation

Authors: Zhihao Zhang, Alan Zhu, Lijie Yang, Yihua Xu, Lanting Li, Phitchaya Mangpo Phothilimthana, Zhihao Jia

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5. Evaluation 5.1. Experimental Setups We describe our experimental setups in this section, including language models, downstream datasets, retrievers, and the implementation details of the baseline, as well as our approach.
Researcher Affiliation Collaboration 1Carnegie Mellon University 2University of California, Berkeley 3Google Deep Mind.
Pseudocode Yes Algorithm 1 Ra LMSpec Pipeline.
Open Source Code Yes Evaluation code is publicly available at https://github.com/JackFram/ralm-sys
Open Datasets Yes We thus include four QA datasets in our experiments: Wiki-QA, Web Questions, Natural Question, and Trivia-QA (Yang et al., 2015; Berant et al., 2013; Kwiatkowski et al., 2019; Joshi et al., 2017).
Dataset Splits No The paper mentions using datasets for evaluation but does not specify exact train/validation/test splits with percentages, sample counts, or references to predefined splits.
Hardware Specification Yes We use the VM.GPU.A10 instance on the Oracle cloud, which contains one A10 GPU and 15 CPUs for models that can fit into a single device. For larger models (LLa MA-2-70B) we use instances with four A100-80G and 20 CPUs.
Software Dependencies No The paper mentions software like Python, Pyserini, and FAISS but does not provide specific version numbers for these dependencies.
Experiment Setup Yes For both Ra LMSpec and Ra LMSeq, we set the maximum input prompt length to be 512 tokens and the maximum generation length to be 128 tokens. For document-level iterative Ra LM serving, the maximum length of the retrieved document chunk is set to 256 as in Ram et al. (2023). When OS3 is disabled, Ra LMSpec uses a constant speculation stride s = 3. Whenever OS3 is enabled, Ra LMSpec initializes the speculation stride with s = 1 and lets the scheduler adapt onwards. In all our experiments, we set the window size w = 5 and γmax = 0.6 for estimating γ. For prefetching, we use a prefetch size of 20. We also test with a prefetch size of 256 for the ablation study.