reproducibilityindex.ai

InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks

Authors: Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Jiwei Li, Kun Kuang, Yang Yang, Hongxia Yang, Fei Wu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we introduce Infi Agent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks. Our extensive benchmarking of 34 LLMs uncovers the current challenges encountered in data analysis tasks.
Researcher Affiliation	Collaboration	1Department of Computer Science and Technology, Zhejiang University, Hangzhou, China 2Rochester Institute of Technology 3Byte Dance Inc. 4Shanghai Institute for Advanced Study, Zhejiang University, Shanghai, China 5Shanghai AI Laboratory, Shanghai, China.
Pseudocode	No	The paper describes methods and processes but does not include structured pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Evaluation datasets and toolkits for Infi Agent DABench are released at https://github. com/Infi Agent/Infi Agent.
Open Datasets	Yes	Evaluation datasets and toolkits for Infi Agent DABench are released at https://github. com/Infi Agent/Infi Agent.
Dataset Splits	Yes	We split the dataset into a validation set and a test set. The validation set is open to the public, including 257 questions with 52 csv files and the rest is designated for the test set, which is kept closed to avoid data leakage.
Hardware Specification	No	The paper does not provide specific details on the hardware used for running its experiments, such as exact GPU/CPU models, processor types, or memory amounts.
Software Dependencies	No	The paper mentions software components like 'Python', 'pandas', 'sklearn', and 'numpy', but does not provide specific version numbers for these or other ancillary software dependencies required for replication.
Experiment Setup	Yes	Other implementation details are in Appendix P. We use accuracy as the metric, which is the proportion of questions for which all sub-questions are answered correctly. We use regular expression matching to draw the answer enclosed in answer name[answer] and the exact match to evaluate the performance. We set temperature 0.2, top p 1.0 with nucleus sampling and frequency penalty 0.0 for all the models in the experiments.