Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Program Synthesis via Test-Time Transduction

Authors: Kang-il Lee, Jahyun Koo, Seunghyun Yoon, Minbeom Kim, Hyukhun Koh, Dongryeol Lee, Kyomin Jung

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach on four benchmarks: Playgol, MBPP+, 1D-ARC, and programmatic world modeling on Mini Grid. We demonstrate that our method significantly improves program synthesis in both accuracy and efficiency.
Researcher Affiliation	Collaboration	1Dept. of ECE, Seoul National University 2IPAI, Seoul National University 3Adobe Research
Pseudocode	Yes	Algorithm 1: SYNTRA
Open Source Code	Yes	We release our code at https://github.com/klee972/SYNTRA.
Open Datasets	Yes	We evaluate our method on four program synthesis datasets: Playgol [9], an inductive programming benchmark for string transformation, MBPP+ [31], a benchmark for generating code from a natural language description, 1D-ARC [49], a visual reasoning benchmark, and programmatic world modeling on Mini Grid [7] environment.
Dataset Splits	Yes	In Playgol, the original task is to generate a program consistent with a set of given input-output examples. Each task in Playgol provides five input-output examples; to simulate realistic conditions involving epistemic uncertainty, we use only one example as a training example and treat the remaining four examples as test inputs. MBPP+ provides at least 52 input-output pairs for every task; we utilize one example as training data and between 5 and 50 examples as test cases. In this benchmark [1D-ARC], we use 1 example as the training set and 3 examples as the test set.
Hardware Specification	No	Since we primarily used APIs, there is no specific environment to report. The computational cost is discussed in detail in Section 6.
Software Dependencies	No	The paper mentions several LLMs like gpt-4.1-2025-04-14, gpt-4o-mini-2024-07-18, Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct, gemma-3-27b-it, Claude Sonnet 4, and Deep Seek-V3-0324 as models used, and Python for code generation, but it does not specify version numbers for any libraries or frameworks.
Experiment Setup	Yes	We set the temperature of the program synthesis model to 1 and that of the transduction model to 0.7. Detailed prompts for both models are in Appendix B. Filtering is based on the 32 programs generated using AGA (c = 4, s = 8) with gpt-4o-mini-2024-07-18. In this experiment, we use 10 test inputs out of 50 for MBPP+.