An LLM Compiler for Parallel Function Calling

Authors: Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have benchmarked LLMCompiler on a range of tasks with different patterns of function calling. We observe consistent latency speedup of up to 3.7 , cost savings of up to 6.7 , and accuracy improvement of up to 9% compared to Re Act. Our code is available at https://github.com/Squeeze AILab/LLMCompiler.
Researcher Affiliation Academia 1UC Berkeley 2ICSI 3LBNL. Correspondence to: Amir Gholami <amirgh@berkeley.edu>.
Pseudocode No The paper describes the system components and their interactions but does not include explicit pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/Squeeze AILab/LLMCompiler.
Open Datasets Yes We evaluate LLMCompiler on embarrassingly parallel patterns using Hotpot QA (Yang et al., 2018) and Movie Recommendation (Srivastava et al., 2022), where we observe 1.80 /3.74 speedup and 3.37 /6.73 cost reduction compared to Re Act (Sec. 5.1).
Dataset Splits No The paper mentions using specific datasets for evaluation (e.g., 'comparison dev set' for Hotpot QA), but does not explicitly provide training/validation/test dataset splits with percentages or sample counts to allow reproduction of data partitioning.
Hardware Specification Yes For an open-source model, we use LLa MA-2 (Touvron et al., 2023), which was hosted on 2 A100-80GB GPUs using the v LLM (Kwon et al., 2023) framework.
Software Dependencies Yes We use Open AI s GPT models as closed-source models, in particular, gpt-3.5-turbo (1106 release) for Hotpot QA and Movie Recommendation, gpt-4-turbo (1106 release) for Parallel QA, and gpt-4 (0613 release) for Game of 24. For an open-source model, we use LLa MA-2 (Touvron et al., 2023), which was hosted on 2 A100-80GB GPUs using the v LLM (Kwon et al., 2023) framework.
Experiment Setup Yes All the runs have been carried out with zero temperature, except for thought proposer and state evaluator for the Game of 24 evaluation, where the temperature is set to 0.7. Since Open AI has randomness in outputs even with temperature 0, we have conducted 3 runs, and we reported the average accuracy. Across Re Act, Open AI parallel function calling, and LLMCompiler, we perform 3, 1, and 5-shot learning for Hotpot QA, Movie Recommendation, and Parallel QA, respectively; the same examples across different methods were used to ensure a fair comparison. For the Game of 24, we use 2 in-context examples for the Planner. We use the same instruction prompts across different methods for a fair comparison, except for Re Act in Sec. 5.1 with additional Re Act-specific prompts. For Web Shop experiment, we use gpt-4-0613 with 8k context window and gpt-3.5-turbo model with 16k context window.