Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Once Upon an Input: Reasoning via Per-Instance Program Synthesis

Authors: Adam Stein, Neelay Velingker, Mayur Naik, Eric Wong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across three frontier LLMs and 30 benchmarks including all tasks of Big Bench Extra Hard (BBEH), visual question answering tasks, relational reasoning tasks, and mathematical reasoning tasks show that PIPS improves the absolute harmonic mean accuracy by up to 8.6% and 9.4% compared to Po T and Co T respectively
Researcher Affiliation	Academia	Adam Stein Neelay Velingker Mayur Naik Eric Wong EMAIL University of Pennsylvania
Pseudocode	Yes	Algorithm 1 PIPS: Synthesis Loop
Open Source Code	Yes	Code for experiments and a demo is open-sourced at https://github.com/adaminsky/pips.
Open Datasets	Yes	We evaluate our approach using 23 tasks sourced from the Big Bench Extra Hard (BBEH) benchmark [7]. These tasks span topics such as geometric understanding, deductive logical reasoning, and commonsense understanding. Furthermore, we extend this study to the visual reasoning tasks CLEVR [22] and Leaf [24], the relational reasoning task CLUTTR [25], and four mathematical reasoning tasks of Omni Math [26].
Dataset Splits	Yes	For all datasets, we reserve a random sample of 20% of the data for calibration of our confidence switch, and we evaluate on the remaining 80% of the data.
Hardware Specification	Yes	Other experiments were run on a server with 96 Intel(R) Xeon(R) Gold 5318Y CPUs @ 2.10GHz with 1TB of system RAM. The server also had 10x NVIDIA A100 80GB GPUs which were only used for local testing of open-weights models.
Software Dependencies	No	The paper mentions several software components like Python, OpenCV, and Pillow, and uses LLMs like Gemini-2.0-Flash, GPT-4.1-mini, and o4-mini. However, it does not provide specific version numbers for any of the ancillary software libraries or programming languages used in the experimental setup.
Experiment Setup	Yes	For all models we used a temperature of 0.0. For PIPS, we used a maximum of 30 iterations for all models.