Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks

Authors: Saurabh Jha, Rohan R. Arora, Yuji Watanabe, Takumi Yanagawa, Yinfang Chen, Jackson Clark, Bhavya Bhavya, Mudit Verma, Harshit Kumar, Hirokuni Kitahara, Noah Zheutlin, Saki Takano, Divya Pathak, Felix George, Xinbo Wu, Bekir O Turkkan, Gerard Vanloo, Michael Nidd, Ting Dai, Oishik Chatterjee, Pranjal Gupta, Suranjana Samanta, Pooja Aggarwal, Rong Lee, Jae-Wook Ahn, Debanjana Kar, Amit Paradkar, Yu Deng, Pratibha Moogi, Prateeti Mohapatra, Naoki Abe, Chandrasekhar Narayanaswami, Tianyin Xu, Lav R. Varshney, Ruchi Mahindru, Anca Sailer, Laura Shwartz, Daby Sow, Nicholas C. M. Fuller, Ruchir Puri

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show that agents powered by state-of-the-art models resolve only 11.4% of SRE scenarios, 25.2% of CISO scenarios, and 25.8% of Fin Ops scenarios (excluding anomaly detection). For Fin Ops-specific anomaly detection (AD) scenarios, AI agents achieve an F1 score of 0.35.
Researcher Affiliation Collaboration 1IBM 2University of Illinois at Urbana-Champaign. Correspondence to: Saurabh Jha <EMAIL>.
Pseudocode No The paper describes methodologies using natural language, mathematical formulations (POMDP), and illustrative examples of agent trajectories and tool usage (Figures 3, 13, 14, 15, 26, 27), but does not present any formal pseudocode or algorithm blocks for the AI agents' core logic.
Open Source Code Yes ITBench, along with a leaderboard and sample agent implementations, is available at https://github.com/ibm/itbench. To help the greater community reproduce the results presented in this paper and build on the ITBench, we open source all of our resources that were created for this project. The source code for the interactive pipeline, context management logic, command implementations, interface design, and everything else is entirely available in a Git Hub repository.
Open Datasets Yes For data insights and anomaly detection use cases, we have leveraged sample Fin Ops data from Fin Ops Foundation (foc) in FOCUS format. Dataset includes more than 5 million cost records of 107 accounts for the services of AWS, Microsoft Azure, Oracle, and GCP.
Dataset Splits No The paper defines a set of 102 scenarios for benchmarking, which serve as the evaluation set. It describes categories like SRE (42), CISO (50), and Fin Ops (10) scenarios, and further categorizes them by complexity (Easy, Medium, Hard). However, it does not specify explicit train/test/validation splits for any dataset used in the experiments, as the experiments involve evaluating pre-trained AI agents on these predefined scenarios.
Hardware Specification Yes We conduct our experiments primarily on AWS EC2 instances (m4.xlarge)... For our experiments, we utilized an AWS m4 xlarge cluster configured with 1 control-plane node and 3 worker nodes. The worker nodes had 12 cores and 48 Gi B of RAM, with 16 cores and 64 Gi B of RAM being used in total. ITBench also supports experiments on Kind clusters... We validated this capability on a machine with the following configuration: 1 control-plane node, Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, 12 CPU cores, and 16 GB RAM, running Red Hat Enterprise Linux.
Software Dependencies Yes Specifically, we employ GPT-4o (checkpoint version 2024-11-20), Llama3.3-70B-instruct, Llama-3.1-8B-instruct, and Granite-3.18B-instruct for tasks that rely on natural language understanding and reasoning. For code-focused use cases, we utilize GPT-4o-mini, Llama-3.1-405b-instruct, and Mixtral8x7b-instruct.
Experiment Setup Yes Table 16: Model hyper-parameters. temperature 0 top_p 1e-7 seed 42 decoding_method greedy