Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DyFlow: Dynamic Workflow Framework for Agentic Reasoning

Authors: Yanbo Wang, Zixiang Xu, Yue Huang, Xiangqi Wang, Zirui Song, Lang Gao, Chenxi Wang, Robert Tang, Yue Zhao, Arman Cohan, Xiangliang Zhang, Xiuying Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We systematically evaluate Dy Flow across diverse domains, including social reasoning, biomedical tasks, mathematical problem solving, and code generation. Results demonstrate that Dy Flow significantly outperforms existing baselines, achieving substantial Pass@k improvements and exhibiting robust generalization across diverse domains.
Researcher Affiliation	Academia	1Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) 2University of Notre Dame, 3Yale University, 4University of Southern California Email: EMAIL Corresponding author
Pseudocode	Yes	Algorithm 1 Dy Flow Framework for Complex Reasoning Require: Task p, templates O, designer πθ, executor πexec, budget Tmax, summarizer fsummary Ensure: Final answer s T and trajectory τ 1: s0 {p}; τ [ ]; M {} initial state, trace, and empty memory dictionary 2: for t = 0 to Tmax 1 do 3: zt fsummary(st); Gt πθ( \| zt) summarize context and sample stage subgraph 4: for each o = (Ok, ϕ, ψ) Vt (topological order) do 5: Retrieve inputs [M[k] \| k ψ] ψ is a list of keys, M is a dictionary 6: r πexec(ϕ, [M[k] \| k ψ]) 7: Generate a unique key kr for r e.g., by operator id or execution order 8: M[kr] r store output in memory dictionary 9: end for 10: st+1 Update State(st, Gt, M); τ.append((st, Gt)) update state and record step 11: if Ct end satisfied or TERMINATE then 12: break stop if designer signals completion 13: end if 14: end for 15: return s T , τ
Open Source Code	Yes	The code is publicly available at https://github.com/wyf23187/Dy Flow.
Open Datasets	Yes	Datasets. We consider 5 diverse reasoning domains, each represented by a benchmark dataset: (1) Logical Reasoning using the Live Bench dataset [26], (2) Math Reasoning using the MATH benchmark [27], (3) Medical Reasoning with Pub Med QA [4], (4) Code Reasoning via Human Eval [28], and (5) Social Reasoning using the Social Maze [29] benchmark.
Dataset Splits	Yes	Table 7 summarizes the dataset statistics used in our experiments. We adopt a default TRAIN:TEST split ratio of approximately 1:3 across all datasets to balance supervision and evaluation coverage. For the MATH benchmark, we follow the setting in Ma AS [36], selecting problems at difficulty level 5 across four representative categories: Combinatorics & Probability, Number Theory, Pre-algebra, and Pre-calculus. For Human Eval, when it is not included in the training set, we evaluate on the full 164 problems to ensure consistency with prior work.
Hardware Specification	Yes	We use Phi-4 [30] as the designer policy πθ, trained on 2 Nvidia A6000 GPUs with Lo RA-based parameter-efficient tuning [37].
Software Dependencies	No	The paper mentions using Phi-4 as the model and LoRA for tuning, but does not provide specific software versions for libraries like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Training proceeds in two stages. First, supervised fine-tuning is performed on 1.5k design results from MATH, Pub Med QA, and Livebench, using a cutoff length of 2048, batch size 1 with gradient accumulation steps of 4, learning rate 5 10 6, cosine learning rate scheduler with warmup ratio 0.1, bf16 precision, and 3 training epochs. The validation split is 10%. Second, we apply KTO [23] for preference-based refinement using 2k design results with a 1:1 positive-to-negative ratio, labeled based on task success. This stage uses a cutoff length of 4096, batch size 1 with gradient accumulation steps of 8, learning rate 2 10 4, KL penalty β = 0.1, bf16 precision, cosine scheduler, and 3 epochs. Validation is performed every 500 steps. At inference time, both the designer and executor are run with temperature 0.01 to ensure deterministic outputs.