Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Once Upon an Input: Reasoning via Per-Instance Program Synthesis
Authors: Adam Stein, Neelay Velingker, Mayur Naik, Eric Wong
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across three frontier LLMs and 30 benchmarks including all tasks of Big Bench Extra Hard (BBEH), visual question answering tasks, relational reasoning tasks, and mathematical reasoning tasks show that PIPS improves the absolute harmonic mean accuracy by up to 8.6% and 9.4% compared to Po T and Co T respectively |
| Researcher Affiliation | Academia | Adam Stein Neelay Velingker Mayur Naik Eric Wong EMAIL University of Pennsylvania |
| Pseudocode | Yes | Algorithm 1 PIPS: Synthesis Loop |
| Open Source Code | Yes | Code for experiments and a demo is open-sourced at https://github.com/adaminsky/pips. |
| Open Datasets | Yes | We evaluate our approach using 23 tasks sourced from the Big Bench Extra Hard (BBEH) benchmark [7]. These tasks span topics such as geometric understanding, deductive logical reasoning, and commonsense understanding. Furthermore, we extend this study to the visual reasoning tasks CLEVR [22] and Leaf [24], the relational reasoning task CLUTTR [25], and four mathematical reasoning tasks of Omni Math [26]. |
| Dataset Splits | Yes | For all datasets, we reserve a random sample of 20% of the data for calibration of our confidence switch, and we evaluate on the remaining 80% of the data. |
| Hardware Specification | Yes | Other experiments were run on a server with 96 Intel(R) Xeon(R) Gold 5318Y CPUs @ 2.10GHz with 1TB of system RAM. The server also had 10x NVIDIA A100 80GB GPUs which were only used for local testing of open-weights models. |
| Software Dependencies | No | The paper mentions several software components like Python, OpenCV, and Pillow, and uses LLMs like Gemini-2.0-Flash, GPT-4.1-mini, and o4-mini. However, it does not provide specific version numbers for any of the ancillary software libraries or programming languages used in the experimental setup. |
| Experiment Setup | Yes | For all models we used a temperature of 0.0. For PIPS, we used a maximum of 30 iterations for all models. |