reproducibilityindex.ai

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Authors: Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, Yueting Zhuang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 Experiments. 4.3 Quantitative Evaluation. Dataset To conduct our evaluation, we invite some annotators to submit some requests. We collect these data as the evaluation dataset. We use GPT-4 to generate task planning as the pseudo labels, which cover single, sequential, and graph tasks. Table 3: Evaluation for the single task. Acc and Pre represents Accuracy and Precision.
Researcher Affiliation	Collaboration	Yongliang Shen1,2, , Kaitao Song2, , , Xu Tan2, Dongsheng Li2, Weiming Lu1, , Yueting Zhuang1, Zhejiang University1, Microsoft Research Asia2 {syl, luwm, yzhuang}@zju.edu.cn, {kaitaosong, xuta, dongsli}@microsoft.com
Pseudocode	No	The paper describes the four stages of Hugging GPT's workflow in prose and visually in Figure 2, but does not provide any structured pseudocode or an algorithm block.
Open Source Code	Yes	https://github.com/microsoft/JARVIS
Open Datasets	No	Dataset To conduct our evaluation, we invite some annotators to submit some requests. We collect these data as the evaluation dataset. We use GPT-4 to generate task planning as the pseudo labels, which cover single, sequential, and graph tasks. Furthermore, we invite some expert annotators to label task planning for some complex requests (46 examples) as a high-quality humanannotated dataset. We also plan to improve the quality and quantity of this dataset to further assist in evaluating the LLM s planning capabilities, which remains a future work. More details about this dataset are in Appendix A.2.
Dataset Splits	No	The paper collects evaluation data and categorizes it (single task, sequential, graph), but does not explicitly state training, validation, and test splits (e.g., percentages or counts) for these datasets, nor does it refer to predefined standard splits for reproducibility.
Hardware Specification	No	The paper states that GPT models are "publicly accessible through the Open AI API" and mentions using "hybrid inference endpoints" including "local inference endpoints" and "cloud service (e.g., Hugging Face)". However, it does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts for these endpoints or their local setup.
Software Dependencies	No	The paper mentions using specific LLMs (gpt-3.5-turbo, text-davinci-003, gpt-4) via the OpenAI API and various models from Hugging Face (e.g., facebook/detr-resnet-101). However, it does not provide specific version numbers for ancillary software components, libraries, or frameworks (e.g., Python version, PyTorch/TensorFlow versions, Hugging Face Transformers library version) that would be needed for reproducibility.
Experiment Setup	Yes	To enable more stable outputs of LLM, we set the decoding temperature to 0. In addition, to regulate the LLM output to satisfy the expected format (e.g., JSON format), we set the logit_bias to 0.2 on the format constraints (e.g., { and } ). We provide detailed prompts designed for the task planning, model selection, and response generation stages in Table 1, where {{variable}} indicates the slot which needs to be populated with the corresponding text before being fed into the LLM.