reproducibilityindex.ai

DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning

Authors: Siyuan Guo, Cheng Deng, Ying Wen, Hechang Chen, Yi Chang, Jun Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, DS-Agent with GPT-4 achieves 100% success rate in the development stage, while attaining 36% improvement on average one pass rate across alternative LLMs in the deployment stage. In both stages, DS-Agent achieves the best rank in performance, costing $1.60 and $0.13 per run with GPT-4, respectively.
Researcher Affiliation	Academia	1School of Artificial Intelligence, Jilin University 2Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Jilin University 3International Center of Future Science, Jilin University 4Shanghai Jiao Tong University 5University College London.
Pseudocode	Yes	We summarize the pseudo-code of the automatic pipeline in Algorithm 1. Overall, DS-Agent benefits from the CBR paradigm in two aspects. Secondly, the Revise loop within CBR allows DS-Agent to utilize the execution feedback from the last iteration to guide the case retrieval and to revise the experiment plan via case reuse.
Open Source Code	Yes	Our data and code are open-sourced at https: //github.com/guosyjlu/DS-Agent.
Open Datasets	Yes	Our data and code are open-sourced at https: //github.com/guosyjlu/DS-Agent. We select 30 data science tasks with three data modalities, including text, time series and tabular data, and two fundamental task types of regression and classification. These diverse datasets were sourced from a variety of platforms. For each dataset, we write natural language task description, and split them into training set, validation set and the test set. The detailed dataset description is presented in Table 5.
Dataset Splits	Yes	For each dataset, we write natural language task description, and split them into training set, validation set and the test set.
Hardware Specification	No	The paper mentions using GPT-3.5 and GPT-4 via API calls and refers to an execution environment with 'a NVIDIA GPU card with 24 GB memory' in the prompt design section. However, it does not specify the exact hardware (CPU, GPU models, or memory) used by the authors to run their own experiments.
Software Dependencies	No	For GPT-3.5 and GPT-4, we use the gpt-3.5-turbo-16k and gpt-4-0613 models via the Open AI API. For the open-source LLM, we utilize Mixtral-8x7b-Instruct (Jiang et al., 2024). We utilize llm-embedder (Zhang et al., 2023b) as the pretrained embedding language model. While it lists software like 'transformers', 'torch', and 'sklearn' in the example code, it does not explicitly state specific version numbers for these software dependencies in the main text.
Experiment Setup	Yes	In the development stage, we use the decoding strategy with temperature T = 0.5, while we adjust it to T = 0.7 in the deployment stage to enhance the diversity of generation. We utilize llm-embedder (Zhang et al., 2023b) as the pretrained embedding language model. For DS-Agent in the development stage, we set the iteration times T = 5, number of the retrieved cases k = 5 and the number of the debugging ndebug = 5.