Towards Optimal Caching and Model Selection for Large Model Inference
Authors: Banghua Zhu, Ying Sheng, Lianmin Zheng, Clark Barrett, Michael Jordan, Jiantao Jiao
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, simulations show that the combination of our caching and model multiplexing algorithms greatly improves over the baselines, with up to 50 improvement over the baseline when the ratio between the maximum cost and minimum cost is 100. Experiments on real datasets show a 4.3 improvement in FLOPs over the baseline when the ratio for FLOPs is 10, and a 1.8 improvement in latency when the ratio for average latency is 1.85. |
| Researcher Affiliation | Academia | Banghua Zhu Department of EECS UC Berkeley banghua@berkeley.edu Ying Sheng Computer Science Department Stanford University Ying.Sheng@stanford.edu Lianmin Zheng Department of EECS UC Berkeley lmzheng@berkeley.edu Clark Barrett Computer Science Department Stanford University barrett@cs.stanford.edu Michael I. Jordan Department of EECS UC Berkeley jordan@berkeley.edu Jiantao Jiao Department of EECS UC Berkeley jiantao@berkeley.edu |
| Pseudocode | Yes | Algorithm 1 Caching in Online Learning... Algorithm 2 Joint Design of Caching and Model multiplexing |
| Open Source Code | Yes | The code is available at https://github.com/Ying1123/llm-caching-multiplexing. |
| Open Datasets | Yes | We evaluate our algorithms on two tasks: next-token prediction on the Lambada (Paperno et al., 2016) dataset and chat assistant on the Open Assistant (Köpf et al., 2023) dataset. |
| Dataset Splits | No | The paper mentions total queries and cache size (e.g., 'total queries 10000 and cache size 40') and that a model switcher was 'fine-tune[d]... with 2000 samples,' but it does not specify explicit train/validation/test dataset splits or their proportions. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions models like BERT, OPT, Fast Chat-T5, and Vicuna, but it does not specify the version numbers of any software dependencies or libraries used (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We conduct both simulations and real-world experiments with our proposed methods... We consider 20 distinct prompts and set the cache size to be 10... For the next-token prediction task, we run the offline algorithm with two models: OPT-1.3B and OPT-13B... We fine-tune a BERT base model with 2000 samples as the model switcher by predicting whether the small model can give the correct result and achieve 80.2% accuracy. We work with 100 unseen distinct prompts in the offline setting with total queries 10000 and cache size 40. |