Towards Optimal Caching and Model Selection for Large Model Inference

Authors: Banghua Zhu, Ying Sheng, Lianmin Zheng, Clark Barrett, Michael Jordan, Jiantao Jiao

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, simulations show that the combination of our caching and model multiplexing algorithms greatly improves over the baselines, with up to 50 improvement over the baseline when the ratio between the maximum cost and minimum cost is 100. Experiments on real datasets show a 4.3 improvement in FLOPs over the baseline when the ratio for FLOPs is 10, and a 1.8 improvement in latency when the ratio for average latency is 1.85.
Researcher Affiliation Academia Banghua Zhu Department of EECS UC Berkeley banghua@berkeley.edu Ying Sheng Computer Science Department Stanford University Ying.Sheng@stanford.edu Lianmin Zheng Department of EECS UC Berkeley lmzheng@berkeley.edu Clark Barrett Computer Science Department Stanford University barrett@cs.stanford.edu Michael I. Jordan Department of EECS UC Berkeley jordan@berkeley.edu Jiantao Jiao Department of EECS UC Berkeley jiantao@berkeley.edu
Pseudocode Yes Algorithm 1 Caching in Online Learning... Algorithm 2 Joint Design of Caching and Model multiplexing
Open Source Code Yes The code is available at https://github.com/Ying1123/llm-caching-multiplexing.
Open Datasets Yes We evaluate our algorithms on two tasks: next-token prediction on the Lambada (Paperno et al., 2016) dataset and chat assistant on the Open Assistant (Köpf et al., 2023) dataset.
Dataset Splits No The paper mentions total queries and cache size (e.g., 'total queries 10000 and cache size 40') and that a model switcher was 'fine-tune[d]... with 2000 samples,' but it does not specify explicit train/validation/test dataset splits or their proportions.
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions models like BERT, OPT, Fast Chat-T5, and Vicuna, but it does not specify the version numbers of any software dependencies or libraries used (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We conduct both simulations and real-world experiments with our proposed methods... We consider 20 distinct prompts and set the cache size to be 10... For the next-token prediction task, we run the offline algorithm with two models: OPT-1.3B and OPT-13B... We fine-tune a BERT base model with 2000 samples as the model switcher by predicting whether the small model can give the correct result and achieve 80.2% accuracy. We work with 100 unseen distinct prompts in the offline setting with total queries 10000 and cache size 40.