Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Measuring AI Ability to Complete Long Software Tasks

Authors: Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel Ziegler, Elizabeth Barnes, Lawrence Chan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon, the time humans typically take to complete tasks that AI agents can complete with 50% success rate. We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks. On these tasks, agents built using current frontier AI models such as o3 have a 50% time horizon of around 110 minutes. Furthermore, frontier AI time horizon has doubled approximately every seven months since 2019, though the trend may have accelerated since 2024. The increase in AI agents time horizons seems to be primarily driven by greater reliability, ability to adapt to mistakes, logical reasoning, and capacity for tool use.
Researcher Affiliation	Industry	Thomas Kwa , Ben West , Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles , Seraphina Nix, Tao Lin, Chris Painter, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler Elizabeth Barnes, Lawrence Chan Model Evaluation & Threat Research (METR) Equal contribution. Corresponding author, EMAIL.
Pseudocode	No	The paper describes a logistic regression model in Section 3.1 and various methodologies but does not present any explicit pseudocode or algorithm blocks with structured, code-like steps.
Open Source Code	Yes	The code to reproduce our results can be found at: https://github.com/METR/ eval-analysis-public.
Open Datasets	Yes	We prototype this methodology using three datasets designed to capture skills required for research or software engineering (Section 2.1), totaling 170 tasks with a wide range of difficulty: general software agent tasks from HCAST [8], machine learning research engineering tasks from RE-Bench [2], and Software Atomic Actions (SWAA), a new suite of shorter software tasks that provides signal on pre-2023 models (Appendix B.1.3).
Dataset Splits	No	The paper states: "We perform 8 runs per agent/task pair" for evaluation. It uses a combined suite of 170 tasks (HCAST, RE-Bench, SWAA) as the evaluation set. However, it does not specify any explicit training, validation, or test splits for these tasks within their experimental setup. The models themselves were developed and trained by their respective organizations, not by the authors using these specific splits for their experiments.
Hardware Specification	Yes	The total amount of compute used included about 2,000 H100-hours for RE-Bench environments and 50,000 CPU hours for other environments on a combination of cloud and internal machines, plus roughly 50,000 H100-hour equivalents of compute used internally or from API providers for inference.
Software Dependencies	No	The paper mentions using Python and Bash commands, PyTorch, Vivaria (an open-source platform), and Anthropic/OpenAI APIs. However, it does not provide specific version numbers for any of these software components, which are necessary for reproducible dependency description.
Experiment Setup	Yes	To convert agent run data to task completion time horizon, we first convert the agent performance on each task to a binary value (success or failure). Many tasks are naturally binary, including all SWAA tasks. Continuously scored tasks are binarized via a task-specific threshold chosen to represent human performance. For HCAST, the task-specific threshold is the same target score the human baseliner tries to achieve, which we also use to filter for successful runs. RE-Bench tasks have a fixed time rating of 8 hours, so the task-specific threshold is the average score of 7-9 hour human runs. Once we have agent success rates and time ratings for each task, we fit time horizons by performing the following logistic regression:psuccess(agent, task) = σ((log hagent log ttask) βagent) where ttask is the geometric mean time of successful human baselines, and hagent and βagent are learned parameters, with hagent representing the 50% time horizon. We perform 8 runs per agent/task pair. All models were run using agent scaffolds using the Anthropic or Open AI APIs. We used the same agent scaffolds across the evaluation suite, with no task-specific prompting or scaffolding, except for the SWAA tasks, which used a simple prompting scaffold.