ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Authors: Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, dahai li, Zhiyuan Liu, Maosong Sun
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that Tool LLa MA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to Chat GPT. |
| Researcher Affiliation | Collaboration | 1Tsinghua University 2Model Best Inc. 3Renmin University of China 4Yale University 5We Chat AI, Tencent Inc. 6Zhihu Inc. |
| Pseudocode | No | The paper describes algorithms in text and through conceptual figures, but it does not contain a structured pseudocode or algorithm block explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | The codes, trained models, and demo are publicly available at https://github.com/Open BMB/Tool Bench. |
| Open Datasets | Yes | We first present Tool Bench, an instruction-tuning dataset for tool use, which is constructed automatically using Chat GPT... The codes, trained models, and demo are publicly available at https://github.com/Open BMB/Tool Bench. |
| Dataset Splits | No | We train the model for two epochs and select the model checkpoint with the best performance on the development set and then evaluate it on the test set. However, the paper does not provide specific percentages or sample counts for the training, validation, or test splits. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions several models and APIs (e.g., LLa MA-2 7B model, Chat GPT (gpt-3.5-turbo-16k), Sentence-BERT, BERTBASE), but it does not provide specific version numbers for the underlying software libraries, frameworks, or key dependencies required for reproducibility. |
| Experiment Setup | Yes | For the training hyper parameters, we use a learning rate of 5 10 5, a warmup ratio of 4 10 2, a total batch size of 64, a maximum sequence length of 8192, and use a position interpolation ratio of 2. We train the model for two epochs and select the model checkpoint with the best performance on the development set and then evaluate it on the test set. |