reproducibilityindex.ai

HYSYNTH: Context-Free LLM Approximation for Guiding Program Synthesis

Authors: Shraddha Barke, Emmanuel Anaya Gonzalez, Saketh Ram Kasibatla, Taylor Berg-Kirkpatrick, Nadia Polikarpova

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate this hybrid approach on three domains, and show that it outperforms both unguided search and direct sampling from LLMs, as well as existing program synthesizers. Our evaluation shows that HYSYNTH outperforms both unguided search and LLMs alone, solving 58% of the tasks overall, compared to 40% for unguided search and 6% for LLMs without search. We evaluate HYSYNTH on 299 PBE tasks from three different domains: ARC (160 tasks), STRING (70 tasks) and TENSOR (69 tasks).
Researcher Affiliation	Academia	Shraddha Barke UC San Diego San Diego, USA sbarke@ucsd.edu; Emmanuel Anaya Gonzalez UC San Diego San Diego, USA fanayagonzalez@ucsd.edu; Saketh Ram Kasibatla UC San Diego San Diego, USA skasibatla@ucsd.edu; Taylor Berg-Kirkpatrick UC San Diego San Diego, USA tbergkirkpatrick@ucsd.edu; Nadia Polikarpova UC San Diego San Diego, USA npolikarpova@ucsd.edu
Pseudocode	Yes	Algorithm 1 Bottom-Up Search Algorithm; Algorithm 2 ARC Synthesis Algorithm; Algorithm 3 Transform Synthesis Algorithm
Open Source Code	Yes	The authors have included experimental results in the supplemental data and publicly released the ARC synthesizer on Git Hub, complete with documentation.
Open Datasets	Yes	We evaluate HYSYNTH on 299 PBE tasks from three different domains: ARC (160 tasks), STRING (70 tasks) and TENSOR (69 tasks). ARC Benchmark The 160 ARC tasks are taken from the testing set of ARGA [44]. TENSOR Benchmark The 69 TENSOR tasks taken from TFCODER... STRING Benchmark The 70 STRING tasks are taken from testing set of PROBE, which is derived from the SYGUS benchmark [4].
Dataset Splits	No	The paper does not explicitly describe a separate validation dataset split with proportions or counts for hyperparameter tuning or model selection in the context of their proposed HYSYNTH system.
Hardware Specification	No	The paper discusses the time taken for experiments (e.g., "The average time to sample 100 solutions from GPT4O is 4 seconds, 12 seconds and 20 seconds per task for the STRING, ARC and TENSOR domains, respectively.") but does not specify any particular hardware components like CPU or GPU models, or memory.
Software Dependencies	No	The paper mentions using "Tensor Flow program" for the TENSOR domain and refers to the "Lark parser" in Appendix B, but it does not specify exact version numbers for these or any other software dependencies crucial for reproducibility.
Experiment Setup	Yes	Our main HYSYNTH configuration uses GPT4O as the LLM, with 100 samples per task to learn a PCFG in non-strict mode (i.e. syntactically invalid completions are included in the PCFG learning process, as explained in Sec. 3.1). For ARC, we sample completions with temperature 1 and 4000 max tokens. For TENSOR, we use temperature 1 and 4000 max tokens. For SYGUS, we use temperature 0.5 and 4000 max tokens. We use the same settings for all three LLMs. When prompting GPT4O, we set response_type to JSON. The timeout is set to 10 minutes for all experiments and includes the search time and time to sample LLM completions (and compute PCFG).