Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Can Dependencies Induced by LLM-Agent Workflows Be Trusted?

Authors: Yu Yao, Yiliao (Lia) Song, Yian Xie, Mengdan Fan, Mingyu Guo, Tongliang Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct thorough experiments to evaluate SEQCV by performing extensive benchmark assessments on six standard datasets, comparisons with state-of-the-art (SOTA) reasoning models, efficiency comparison analyses, and investigations using cross-validation modules and workflow decomposition modules.
Researcher Affiliation Academia Sydney AI Centre, The University of Sydney Adelaide University Monash University Peking University EMAIL
Pseudocode Yes Algorithm 1 Segment-level Generation and Validation
Open Source Code Yes Code is available at github.com/tmllab/2025_Neur IPS_Seq CV.
Open Datasets Yes Mathematical reasoning: MATH [58] (617 level-5 problems from geometry, statistics, number theory, algebra, and calculus) and GSM8K [59]; Knowledge-intensive reasoning: MMLU-CF [60]; Logical reasoning: BBH [61]; Multi-hop reasoning: Hotpot QA [62] and Long Bench [63] (samples from Mu Si Que [64] and 2Wiki Multi Hop QA [65] for long-context reasoning).
Dataset Splits Yes For each benchmark dataset, we randomly sample 300 examples for evaluation.
Hardware Specification No The paper does not require specific CPU or GPU resources.
Software Dependencies No The paper does not explicitly state software library versions used to replicate experiments.
Experiment Setup Yes We evaluate 8 tasks designed to test creativity, multi-step reasoning, and adherence to diverse constraints. The detailed task prompts can be found in Appendix E. Baselines We compare the generated results against three recent baselines: Flow [38], AFlow [39] and Atom [41], as well as o4-mini-high [42] via Open AI interface. We implement SEQCV by using a mixture of o4-mini and o3-mini to complete each task. For each method, we run three times and select the best result. Appendix E (Prompts for SEQCV) also provides detailed prompts used.