Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds

Authors: Yinfang Chen, Jiaqi Pan, Jackson Clark, Yiming Su, Noah Zheutlin, Bhavya Bhavya, Rohan R. Arora, Yu Deng, Saurabh Jha, Tianyin Xu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate STRATUS on two SRE benchmark suites, AIOps Lab [19] and ITBench [35]. STRATUS significantly outperforms state-of-the-art SRE agents in terms of success rate of failure mitigation problems in AIOps Lab and ITBench, by at least 1.5 times across various models (GPT-4o, GPT4o-mini, and Llama3). We measure agentic systems using metrics: Success Rate (the percentage of problems being successfully solved), average time (in seconds), the number of steps to solve a problem, and monetary cost in US dollars (calculated based on token consumptions).
Researcher Affiliation	Collaboration	University of Illinois Urbana-Champaign IBM Research IIIS, Tsinghua University
Pseudocode	No	The paper describes the system architecture and control flow using a state machine diagram (Figure 2) and textual descriptions in Section 3 and 4, but does not include a formal pseudocode block or algorithm.
Open Source Code	Yes	The artifacts of STRATUS can be found at https://github.com/xlab-uiuc/stratus.
Open Datasets	Yes	We evaluate STRATUS on two SRE benchmark suites, AIOps Lab [19] and ITBench [35]. Both benchmarks provide a live, arena-like environment, where AI agents are asked to resolve problems; each problem encodes a failure in emulated cloud systems [3,25,84], as exemplified by Figure 4.
Dataset Splits	Yes	We evaluate STRATUS on two SRE benchmark suites, AIOps Lab [19] and ITBench [35]. ...Concretely, it successfully solves 69.2% (9/13) and 50.0% (9/18) mitigation problems in AIOps Lab and ITBench, respectively.
Hardware Specification	No	The paper mentions running experiments, profiling memory usage, and execution time, but does not specify the exact hardware (e.g., specific GPU/CPU models, memory amounts, or cloud instances) used for these experiments in the main text or appendix.
Software Dependencies	Yes	STRATUS is implemented using the Crew AI multi-agent framework [1]. ...We currently use off-the-shelf LLMs such as GPT and Llama models without fune-tuning. ...We fuel these agents with different LLMs, including GPT-4o, GPT-4o-mini, and Llama 3.3. ...4o refers to GPT-4o (gpt-4o-2024-08-06) mini refers to GPT-4o-mini (gpt-4o-mini-2024-07-18) llama refers to Llama 3.3 (llama-3-3-70b-instruct).
Experiment Setup	Yes	The LLM temperature is set to 0 to make the results more deterministic. ...We set K to 20 based on an empirical analysis (in appendix). ...We perform a hyperparameter sweep over the temperature parameter in STRATUS (GPT-4o) using all 13 mitigation tasks from AIOps Lab. Table 7 shows the results. We can see that the temperature of 0.0 yields the best results.