InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks
Authors: Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Jiwei Li, Kun Kuang, Yang Yang, Hongxia Yang, Fei Wu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we introduce Infi Agent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks. Our extensive benchmarking of 34 LLMs uncovers the current challenges encountered in data analysis tasks. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science and Technology, Zhejiang University, Hangzhou, China 2Rochester Institute of Technology 3Byte Dance Inc. 4Shanghai Institute for Advanced Study, Zhejiang University, Shanghai, China 5Shanghai AI Laboratory, Shanghai, China. |
| Pseudocode | No | The paper describes methods and processes but does not include structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Evaluation datasets and toolkits for Infi Agent DABench are released at https://github. com/Infi Agent/Infi Agent. |
| Open Datasets | Yes | Evaluation datasets and toolkits for Infi Agent DABench are released at https://github. com/Infi Agent/Infi Agent. |
| Dataset Splits | Yes | We split the dataset into a validation set and a test set. The validation set is open to the public, including 257 questions with 52 csv files and the rest is designated for the test set, which is kept closed to avoid data leakage. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for running its experiments, such as exact GPU/CPU models, processor types, or memory amounts. |
| Software Dependencies | No | The paper mentions software components like 'Python', 'pandas', 'sklearn', and 'numpy', but does not provide specific version numbers for these or other ancillary software dependencies required for replication. |
| Experiment Setup | Yes | Other implementation details are in Appendix P. We use accuracy as the metric, which is the proportion of questions for which all sub-questions are answered correctly. We use regular expression matching to draw the answer enclosed in answer name[answer] and the exact match to evaluate the performance. We set temperature 0.2, top p 1.0 with nucleus sampling and frequency penalty 0.0 for all the models in the experiments. |