HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Authors: Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, Yueting Zhuang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments. 4.3 Quantitative Evaluation. Dataset To conduct our evaluation, we invite some annotators to submit some requests. We collect these data as the evaluation dataset. We use GPT-4 to generate task planning as the pseudo labels, which cover single, sequential, and graph tasks. Table 3: Evaluation for the single task. Acc and Pre represents Accuracy and Precision. |
| Researcher Affiliation | Collaboration | Yongliang Shen1,2, , Kaitao Song2, , , Xu Tan2, Dongsheng Li2, Weiming Lu1, , Yueting Zhuang1, Zhejiang University1, Microsoft Research Asia2 {syl, luwm, yzhuang}@zju.edu.cn, {kaitaosong, xuta, dongsli}@microsoft.com |
| Pseudocode | No | The paper describes the four stages of Hugging GPT's workflow in prose and visually in Figure 2, but does not provide any structured pseudocode or an algorithm block. |
| Open Source Code | Yes | https://github.com/microsoft/JARVIS |
| Open Datasets | No | Dataset To conduct our evaluation, we invite some annotators to submit some requests. We collect these data as the evaluation dataset. We use GPT-4 to generate task planning as the pseudo labels, which cover single, sequential, and graph tasks. Furthermore, we invite some expert annotators to label task planning for some complex requests (46 examples) as a high-quality humanannotated dataset. We also plan to improve the quality and quantity of this dataset to further assist in evaluating the LLM s planning capabilities, which remains a future work. More details about this dataset are in Appendix A.2. |
| Dataset Splits | No | The paper collects evaluation data and categorizes it (single task, sequential, graph), but does not explicitly state training, validation, and test splits (e.g., percentages or counts) for these datasets, nor does it refer to predefined standard splits for reproducibility. |
| Hardware Specification | No | The paper states that GPT models are "publicly accessible through the Open AI API" and mentions using "hybrid inference endpoints" including "local inference endpoints" and "cloud service (e.g., Hugging Face)". However, it does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts for these endpoints or their local setup. |
| Software Dependencies | No | The paper mentions using specific LLMs (gpt-3.5-turbo, text-davinci-003, gpt-4) via the OpenAI API and various models from Hugging Face (e.g., facebook/detr-resnet-101). However, it does not provide specific version numbers for ancillary software components, libraries, or frameworks (e.g., Python version, PyTorch/TensorFlow versions, Hugging Face Transformers library version) that would be needed for reproducibility. |
| Experiment Setup | Yes | To enable more stable outputs of LLM, we set the decoding temperature to 0. In addition, to regulate the LLM output to satisfy the expected format (e.g., JSON format), we set the logit_bias to 0.2 on the format constraints (e.g., { and } ). We provide detailed prompts designed for the task planning, model selection, and response generation stages in Table 1, where {{variable}} indicates the slot which needs to be populated with the corresponding text before being fed into the LLM. |