HYSYNTH: Context-Free LLM Approximation for Guiding Program Synthesis
Authors: Shraddha Barke, Emmanuel Anaya Gonzalez, Saketh Ram Kasibatla, Taylor Berg-Kirkpatrick, Nadia Polikarpova
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate this hybrid approach on three domains, and show that it outperforms both unguided search and direct sampling from LLMs, as well as existing program synthesizers. Our evaluation shows that HYSYNTH outperforms both unguided search and LLMs alone, solving 58% of the tasks overall, compared to 40% for unguided search and 6% for LLMs without search. We evaluate HYSYNTH on 299 PBE tasks from three different domains: ARC (160 tasks), STRING (70 tasks) and TENSOR (69 tasks). |
| Researcher Affiliation | Academia | Shraddha Barke UC San Diego San Diego, USA sbarke@ucsd.edu; Emmanuel Anaya Gonzalez UC San Diego San Diego, USA fanayagonzalez@ucsd.edu; Saketh Ram Kasibatla UC San Diego San Diego, USA skasibatla@ucsd.edu; Taylor Berg-Kirkpatrick UC San Diego San Diego, USA tbergkirkpatrick@ucsd.edu; Nadia Polikarpova UC San Diego San Diego, USA npolikarpova@ucsd.edu |
| Pseudocode | Yes | Algorithm 1 Bottom-Up Search Algorithm; Algorithm 2 ARC Synthesis Algorithm; Algorithm 3 Transform Synthesis Algorithm |
| Open Source Code | Yes | The authors have included experimental results in the supplemental data and publicly released the ARC synthesizer on Git Hub, complete with documentation. |
| Open Datasets | Yes | We evaluate HYSYNTH on 299 PBE tasks from three different domains: ARC (160 tasks), STRING (70 tasks) and TENSOR (69 tasks). ARC Benchmark The 160 ARC tasks are taken from the testing set of ARGA [44]. TENSOR Benchmark The 69 TENSOR tasks taken from TFCODER... STRING Benchmark The 70 STRING tasks are taken from testing set of PROBE, which is derived from the SYGUS benchmark [4]. |
| Dataset Splits | No | The paper does not explicitly describe a separate validation dataset split with proportions or counts for hyperparameter tuning or model selection in the context of their proposed HYSYNTH system. |
| Hardware Specification | No | The paper discusses the time taken for experiments (e.g., "The average time to sample 100 solutions from GPT4O is 4 seconds, 12 seconds and 20 seconds per task for the STRING, ARC and TENSOR domains, respectively.") but does not specify any particular hardware components like CPU or GPU models, or memory. |
| Software Dependencies | No | The paper mentions using "Tensor Flow program" for the TENSOR domain and refers to the "Lark parser" in Appendix B, but it does not specify exact version numbers for these or any other software dependencies crucial for reproducibility. |
| Experiment Setup | Yes | Our main HYSYNTH configuration uses GPT4O as the LLM, with 100 samples per task to learn a PCFG in non-strict mode (i.e. syntactically invalid completions are included in the PCFG learning process, as explained in Sec. 3.1). For ARC, we sample completions with temperature 1 and 4000 max tokens. For TENSOR, we use temperature 1 and 4000 max tokens. For SYGUS, we use temperature 0.5 and 4000 max tokens. We use the same settings for all three LLMs. When prompting GPT4O, we set response_type to JSON. The timeout is set to 10 minutes for all experiments and includes the search time and time to sample LLM completions (and compute PCFG). |