Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

TAI3: Testing Agent Integrity in Interpreting User Intent

Authors: Shiwei Feng, Xiangzhe Xu, Xuan Chen, Kaiyuan Zhang, Syed Ahmed, Zian Su, Mingwei Zheng, Xiangyu Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on 80 toolkit APIs demonstrate that TAI3 effectively uncovers intent integrity violations, significantly outperforming baselines in both error-exposing rate and query efficiency. Moreover, TAI3 generalizes well to stronger target models using smaller LLMs for test generation, and adapts to evolving APIs across domains. [...] We evaluate TAI3 on 3 representative categories of LLMs as target models [...] RQ1: How effective is the proposed stress testing framework in uncovering agents errors? [...] Table 1 shows that, cross all domains and input categories, our method consistently outperforms the Self Ref baseline in terms of EESR. [...] We conduct an ablation study to isolate the contributions of the predictive model and retrieval strategies.
Researcher Affiliation Academia Shiwei Feng , Xiangzhe Xu , Xuan Chen, Kaiyuan Zhang, Syed Yusuf Ahmed, Zian Su, Mingwei Zheng, Xiangyu Zhang Department of Computer Science, Purdue University EMAIL
Pseudocode No The paper describes the design of TAI3 through textual descriptions and a pipeline diagram (Figure 3), but it does not contain any explicit pseudocode or algorithm blocks.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: We will release our code upon acceptance.
Open Datasets Yes We construct a dataset consisting of 80 toolkit APIs and 233 parameters across five domains: finance, healthcare, smart home, logistics, and office. The data are adopted from Tool Emu [12].
Dataset Splits No The paper describes the construction of a dataset of toolkit APIs and parameters, and mentions evaluating on benchmarks like Agent-Safety Bench and Tool Emu. However, it does not specify any training, validation, or test splits for models or data used in TAI3's own experimental runs. It primarily focuses on the generation and evaluation of test cases within semantic partitions, not on traditional dataset splits for machine learning model training.
Hardware Specification No The paper states in the NeurIPS Paper Checklist, Question 8, that it provides sufficient details of compute resources in the Appendix. However, a review of the Appendix does not reveal any specific hardware details such as GPU models, CPU types, or memory specifications used for the experiments. It only mentions general computational resources provided by the Center for AI Safety in the Acknowledgement section.
Software Dependencies No The paper mentions various LLMs used as target models or within TAI3 (e.g., Llama-3.1-8B, GPT-4o-mini, Qwen-30B-A3B, phi4-mini). However, it does not specify any ancillary software dependencies (e.g., Python version, specific libraries like PyTorch or TensorFlow, along with their version numbers) that would be needed to replicate the experiments.
Experiment Setup Yes Our default testing LLM (behind TAI3) is GPT-4o-mini. [...] The query budget to the target agent is set to 5, consistent with our Stage 2 sampling process. [...] TAI3 uses a small language model (SLM) to approximate the error likelihood of each mutated task. [...] The top N = 3 strategies are selected to guide the next mutation. [...] In this sensitivity analysis, we vary the number of retrieved strategies used to guide mutation. As shown in Figure 16, using the top 3 retrieved strategies consistently achieves the best EESR performance across all tested models. [...] We choose 3 in our default experiment setting.