Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Optimal Caching and Model Selection for Large Model Inference

Authors: Banghua Zhu, Ying Sheng, Lianmin Zheng, Clark Barrett, Michael Jordan, Jiantao Jiao

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, simulations show that the combination of our caching and model multiplexing algorithms greatly improves over the baselines, with up to 50 improvement over the baseline when the ratio between the maximum cost and minimum cost is 100. Experiments on real datasets show a 4.3 improvement in FLOPs over the baseline when the ratio for FLOPs is 10, and a 1.8 improvement in latency when the ratio for average latency is 1.85.
Researcher Affiliation Academia Banghua Zhu Department of EECS UC Berkeley EMAIL Ying Sheng Computer Science Department Stanford University EMAIL Lianmin Zheng Department of EECS UC Berkeley EMAIL Clark Barrett Computer Science Department Stanford University EMAIL Michael I. Jordan Department of EECS UC Berkeley EMAIL Jiantao Jiao Department of EECS UC Berkeley EMAIL
Pseudocode Yes Algorithm 1 Caching in Online Learning... Algorithm 2 Joint Design of Caching and Model multiplexing
Open Source Code Yes The code is available at https://github.com/Ying1123/llm-caching-multiplexing.
Open Datasets Yes We evaluate our algorithms on two tasks: next-token prediction on the Lambada (Paperno et al., 2016) dataset and chat assistant on the Open Assistant (KΓΆpf et al., 2023) dataset.
Dataset Splits No The paper mentions total queries and cache size (e.g., 'total queries 10000 and cache size 40') and that a model switcher was 'fine-tune[d]... with 2000 samples,' but it does not specify explicit train/validation/test dataset splits or their proportions.
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions models like BERT, OPT, Fast Chat-T5, and Vicuna, but it does not specify the version numbers of any software dependencies or libraries used (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We conduct both simulations and real-world experiments with our proposed methods... We consider 20 distinct prompts and set the cache size to be 10... For the next-token prediction task, we run the offline algorithm with two models: OPT-1.3B and OPT-13B... We fine-tune a BERT base model with 2000 samples as the model switcher by predicting whether the small model can give the correct result and achieve 80.2% accuracy. We work with 100 unseen distinct prompts in the offline setting with total queries 10000 and cache size 40.