reproducibilityindex.ai

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Authors: Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, dahai li, Zhiyuan Liu, Maosong Sun

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that Tool LLa MA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to Chat GPT.
Researcher Affiliation	Collaboration	1Tsinghua University 2Model Best Inc. 3Renmin University of China 4Yale University 5We Chat AI, Tencent Inc. 6Zhihu Inc.
Pseudocode	No	The paper describes algorithms in text and through conceptual figures, but it does not contain a structured pseudocode or algorithm block explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	The codes, trained models, and demo are publicly available at https://github.com/Open BMB/Tool Bench.
Open Datasets	Yes	We first present Tool Bench, an instruction-tuning dataset for tool use, which is constructed automatically using Chat GPT... The codes, trained models, and demo are publicly available at https://github.com/Open BMB/Tool Bench.
Dataset Splits	No	We train the model for two epochs and select the model checkpoint with the best performance on the development set and then evaluate it on the test set. However, the paper does not provide specific percentages or sample counts for the training, validation, or test splits.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running its experiments.
Software Dependencies	No	The paper mentions several models and APIs (e.g., LLa MA-2 7B model, Chat GPT (gpt-3.5-turbo-16k), Sentence-BERT, BERTBASE), but it does not provide specific version numbers for the underlying software libraries, frameworks, or key dependencies required for reproducibility.
Experiment Setup	Yes	For the training hyper parameters, we use a learning rate of 5 10 5, a warmup ratio of 4 10 2, a total batch size of 64, a maximum sequence length of 8192, and use a position interpolation ratio of 2. We train the model for two epochs and select the model checkpoint with the best performance on the development set and then evaluate it on the test set.