Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

EnCompass: Enhancing Agent Programming with Search Over Program Execution Paths

Authors: Zhening Li, Armando Solar-Lezama, Yisong Yue, Stephan Zheng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present three case studies that demonstrate how the framework lets the programmer quickly improve the reliability of an agent and easily switch between different inference-time strategies, all with little additional coding. ... In Case Study 1 (Section 4.1), we implement a Java-to-Python code repository translation agent... We find that beam search outperforms simpler sampling strategies, thus demonstrating how one can use ENCOMPASS to discover better inference-time scaling laws. ... Figure 2: Results of using ENCOMPASS to apply different inference-time scaling methods to the code repository translation agent. All error bars show standard errors of the mean over 5 runs. (a) A comprehensive hyperparameter search for ps0; (b) For ps1 to ps4, we applying global best-of-N ( GBo N ), file-level local best-of-N ( LBo N (c.) ), and beam search at the file and method level ( beam (c.) + beam (f.) ) while controlling for cost.
Researcher Affiliation	Collaboration	Zhening Li Asari AI, MIT CSAIL EMAIL Armando Solar-Lezama Asari AI, MIT CSAIL EMAIL Yisong Yue Asari AI, Caltech CMS EMAIL Stephan Zheng Asari AI EMAIL
Pseudocode	Yes	Listing 4: branchpoint example: Best-of-N sampling ... Listing 6: Graph search example with branchpoint_choose ... Listing 19: Beam search in ENCOMPASS, 5 branchpoints excluding padding ... Listing 20: Beam search implemented in plain Python
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Company code
Open Datasets	Yes	In Case Study 1 (Section 4.1), we implement a Java-to-Python code repository translation agent... We demonstrate these experiments on Java repositories from the MIT OCW Software Construction class. ... The repository contains solutions to the first homework (ps0) from the Spring 2016 version of the MIT Software Construction class available on MIT Open Course Ware [32, 33]. ... We use a subset of the ARC-AGI benchmark corresponding to the 60 tasks sampled from the Public Training Set (Easy) that ADAS [39] used. ... Leet Code is a website with programming exercises... and the Leet Code Hard benchmark is a collection of 40 hard Leet Code problems [7].
Dataset Splits	No	We use a subset of the ARC-AGI benchmark corresponding to the 60 tasks sampled from the Public Training Set (Easy) that ADAS [39] used. We report the mean evaluation score as well as its standard error over 5 seeds.
Hardware Specification	Yes	Experiments for all case studies were conducted on a Macbook Pro with an M3 chip and 18 GB of RAM.
Software Dependencies	No	All LLM calls were made through the Open AI API.
Experiment Setup	Yes	For all experiments, we set the LLM temperature to 0.0 for the base agent (no inference-time strategies), and 0.5 for the ENCOMPASS agent (with inference-time strategies). ... For beam search, we used a file-level beam width of 2 and a method-level beam width of 3, whereas we used N = 16 for both global and local best-of-N. ... The LLM temperature was set to 0.8 for all experiments.