Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

{$\tau$}-bench: A Benchmark for \underline{T}ool-\underline{A}gent-\underline{U}ser Interaction in Real-World Domains

Authors: Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that even state-of-the-art function calling agents (like gpt-4o) succeed on < 50% of the tasks, and are quite inconsistent (passˆ8 < 25% in retail). Our findings point to the need for methods that can improve the ability of agents to act consistently and follow rules reliably. Our experiments reveal that agents built with simple LM constructs (like function calling or Re Act) perform poorly, highlighting the need for more sophisticated agent architectures. For instance, even state-of-the-art LMs like gpt-4o achieve low task success rates (passˆ1) using function calling ( 61% on τ-retail and 35% on τ-airline).
Researcher Affiliation	Industry	The paper states "Work done during internship. Code and data: https://github.com/sierra-research/ tau-bench." While not explicitly stated as institutional affiliations for all authors, the mention of an internship and "sierra-research" in the code repository URL suggests an industry affiliation. No explicit university or company names, or email domains, are provided for the authors in the main author list.
Pseudocode	No	The paper formulates tasks as POMDPs and describes components, but it does not contain a dedicated section or block labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps for a method in a code-like format in the main text. Appendix B.1 contains Python implementation of an API, but not pseudocode for the overall method.
Open Source Code	Yes	Code and data: https://github.com/sierra-research/ tau-bench.
Open Datasets	Yes	Code and data: https://github.com/sierra-research/ tau-bench.
Dataset Splits	No	The paper describes running 'at least 3 trials per task' for evaluation and introduces the 'passˆk' metric based on 'k i.i.d. task trials'. However, it does not provide traditional training, validation, or test dataset splits for a model, as it describes a benchmark for evaluating agents on a set of tasks rather than a dataset for model training.
Hardware Specification	No	The paper mentions using API access to models like 'Open AI GPT API (gpt-4o, gpt-4-turbo, gpt-4-32k, gpt-3.5-turbo)', 'Anthropic Claude API', 'Google Gemini API', 'Mistral API', and 'Any Scale API'. However, it does not provide any specific hardware details such as GPU models, CPU models, or cloud computing specifications used for running the experiments; these models are accessed via APIs.
Software Dependencies	Yes	The paper specifies models and their versions used via APIs, such as 'Open AI GPT API (gpt-4o, gpt-4-turbo, gpt-4-32k, gpt-3.5-turbo)', 'Anthropic Claude API (claude-3-opus, claude-3-sonnet, claude-3-haiku)', 'Google Gemini API (gemini-1.5-pro-latest, gemini-1.5-flash-latest)', 'Mistral API (mistral-large, open-mixtral-8x22b)', and 'Any Scale API (meta-llama-3-70B-instruct)'.
Experiment Setup	Yes	The paper states: 'In FC mode, the model s system prompt is set to be the domain policy, and at each turn, the model autonomously decides to generate a user response message or a tool call.' It also specifies 'We limit each task to at most 30 agent actions (either tool calls or user responses). For main results (Table 2), we run at least 3 trials per task. The LM temperature is 0.0 for agent and 1.0 for user.'