Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?

Authors: Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, Dong Yu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To bridge this gap, we introduce DSBench, a comprehensive benchmark designed to evaluate data science agents with realistic tasks. This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Model Off and Kaggle competitions. DSBench offers a realistic setting by encompassing long contexts, multimodal task backgrounds, reasoning with large data files and multi-table structures, and performing end-to-end data modeling tasks. Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG).
Researcher Affiliation	Collaboration	Liqiang Jing1,2 Zhehui Huang2,3 Xiaoyang Wang2 Wenlin Yao2 Wenhao Yu2 Kaixin Ma2 Hongming Zhang2 Xinya Du1 Dong Yu2 1University of Texas at Dallas 2Tencent AI Lab, Seattle 3University of Southern California
Pseudocode	No	The paper includes Python code snippets in Appendix I.1 for running tasks and demonstrating agent interactions, but it does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks for its proposed methodology.
Open Source Code	Yes	Our contribution can be summarized as follows: (1) We construct a data science benchmark, DSBench, which consists of 466 data analysis tasks and 74 data modeling tasks; (2) To comprehensively evaluate existing approaches for the data modeling tasks, we propose the Relative Performance Gap metric that can normalize various evaluation metrics for data modeling; (3) We evaluate representative state-of-the-art LLMs, LVLMs, and agents including the most recent GPT-4o, Claude, and Gemini models, and find that our benchmark is challenging for most of the existing approaches. 4We released all our data and code on Github https://github.com/LiqiangJing/DSBench.
Open Datasets	Yes	This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Model Off and Kaggle competitions. [...] In total, we collected 466 data analysis tasks from Modeloff and 74 data modeling tasks from Kaggle. [...] 4We released all our data and code on Github https://github.com/LiqiangJing/DSBench.
Dataset Splits	Yes	Since the testing set in the Kaggle competition is inaccessible, we split the original training set into the training set and testing set as an 8:2 ratio for evaluation. In this way, we could directly get the performance of the solution devised by a data science agent, avoiding submitting the solution to the Kaggle website.
Hardware Specification	Yes	All the open-source models are run on a 4 NVIDIA A100 GPU server.
Software Dependencies	No	The paper specifies versions for the large language models evaluated (e.g., LLa VA-1.5-13b, LLa MA 3, gpt-3.5-turbo-0125, etc.) but does not provide version numbers for ancillary software libraries like Pandas or Scikit-learn, which are used in the provided code examples within the appendices.
Experiment Setup	Yes	S( ) is the semantics comparison function that is implemented by a LLM and prompt in Appendix C. [...] We simply use greedy decoding for all models. [...] The prompt for data analysis tasks is shown as follows. [...] The prompt for data modeling tasks is shown as follows.