Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ReCAP: Recursive Context-Aware Reasoning and Planning for Large Language Model Agents

Authors: Zhenyu Zhang, Tianyi Chen, Weiran Xu, Alex Pentland, Jiaxin Pei

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Re CAP across embodied and knowledge-intensive tasks with different planning horizons and feedback dynamics: Robotouille [7], ALFWorld [21], FEVER [23], and SWE-bench [9]. ALFWorld features short, largely linear embodied sequences. In Robotouille, the horizon grows much longer, and subgoals must be interleaved or refined continuously under resource contention. FEVER remains a shallow, tool-mediated retrieval task with a small symbolic action API. SWE-bench expands the action space from finite to effectively unbounded: the agent must compose multi-step code edits in a space far larger than environments with a limited verb set.
Researcher Affiliation	Academia	Zhenyu Zhang1 Tianyi Chen1 Weiran Xu1 Alex Pentland2,3 Jiaxin Pei2 1 Department of Computer Science, Stanford University 2 Stanford Institute for Human-Centered AI 3 MIT Media Lab EMAIL EMAIL EMAIL EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Re CAP Require: LLM Context C Ensure: Updated context C 1: (T, S) π(C) 2: while S not empty do 3: if S[0] is primitive then 4: O E(S[0]) 5: C C T, S, S[0], O 6: else 7: C Re CAP(C T, S, S[0] ) 8: end if 9: C C T, S[1 :] 10: (T, S) ρ(C) 11: end while 12: return C
Open Source Code	No	We will release anonymized code, prompt templates, environment setup instructions, and evaluation scripts in the supplemental material to enable faithful reproduction of all main results. Public access to the code and documentation will be provided upon publication.
Open Datasets	Yes	We evaluate Re CAP on four benchmarks, spanning coding, embodied, and text-based reasoning: ALFWorld [21], Robotouille [7], FEVER [23], and SWE-bench Verified [9].
Dataset Splits	Yes	We evaluate on the official unseen split with the provided symbolic interface. For few-shot construction, we pick one training task per each of the six categories (seen in prior work), adapt the narration to the agent s prompt format, and keep identical rule descriptions across agents. We evaluate 10 synchronous and 10 asynchronous recipes, each with 10 official instances. We evaluate 200 randomly sampled claims with a single shared demonstration from the training set adapted to each agent s prompt format.
Hardware Specification	No	All experiments were conducted via commercial API calls to proprietary and open-source language models (e.g., Open AI [4], Llama [12], Deep Seek[3]). We report total token usage and corresponding cost estimates for each benchmark in Appendix. Since no local training or GPU inference was performed, compute resources are quantified in terms of API usage rather than hardware.
Software Dependencies	Yes	For Robotouille, ALFWorld, and FEVER, we use the same task from the training sets to construct the one-shot examples and apply the same step limitations across agents, and we use GPT-4o (2024-08-06) to conduct all the main experiments. For SWE-bench... The LLM used is GPT-4.1 (2025-04-14), with no demonstrations and a temperature set to zero. We modified the original SWE-agent code, which uses Lite LLM to call the Open AI API.
Experiment Setup	Yes	All evaluations are conducted under a pass@1 setting: each agent is allowed a single reasoning-execution trajectory per task instance, without retries, beam search, or sample-level ensembling. Both methods share identical LLM settings (temperature = 0.5, default max tokens GPT-4o s 128 K window and no additional sampling penalties) and the same max_step_multiplier=4 (up to four times the theoretical optimal step count per task) to allow room for error recovery. For SWE-bench... GPT-4.1 (2025-04-14), with no demonstrations and a temperature set to zero.