Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Can Dependencies Induced by LLM-Agent Workflows Be Trusted?
Authors: Yu Yao, Yiliao (Lia) Song, Yian Xie, Mengdan Fan, Mingyu Guo, Tongliang Liu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct thorough experiments to evaluate SEQCV by performing extensive benchmark assessments on six standard datasets, comparisons with state-of-the-art (SOTA) reasoning models, efficiency comparison analyses, and investigations using cross-validation modules and workflow decomposition modules. |
| Researcher Affiliation | Academia | Sydney AI Centre, The University of Sydney Adelaide University Monash University Peking University EMAIL |
| Pseudocode | Yes | Algorithm 1 Segment-level Generation and Validation |
| Open Source Code | Yes | Code is available at github.com/tmllab/2025_Neur IPS_Seq CV. |
| Open Datasets | Yes | Mathematical reasoning: MATH [58] (617 level-5 problems from geometry, statistics, number theory, algebra, and calculus) and GSM8K [59]; Knowledge-intensive reasoning: MMLU-CF [60]; Logical reasoning: BBH [61]; Multi-hop reasoning: Hotpot QA [62] and Long Bench [63] (samples from Mu Si Que [64] and 2Wiki Multi Hop QA [65] for long-context reasoning). |
| Dataset Splits | Yes | For each benchmark dataset, we randomly sample 300 examples for evaluation. |
| Hardware Specification | No | The paper does not require specific CPU or GPU resources. |
| Software Dependencies | No | The paper does not explicitly state software library versions used to replicate experiments. |
| Experiment Setup | Yes | We evaluate 8 tasks designed to test creativity, multi-step reasoning, and adherence to diverse constraints. The detailed task prompts can be found in Appendix E. Baselines We compare the generated results against three recent baselines: Flow [38], AFlow [39] and Atom [41], as well as o4-mini-high [42] via Open AI interface. We implement SEQCV by using a mixture of o4-mini and o3-mini to complete each task. For each method, we run three times and select the best result. Appendix E (Prompts for SEQCV) also provides detailed prompts used. |