A large-scale benchmark for few-shot program induction and synthesis
Authors: Ferran Alet, Javier Lopez-Contreras, James Koppel, Maxwell Nye, Armando Solar-Lezama, Tomas Lozano-Perez, Leslie Kaelbling, Joshua Tenenbaum
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We analyze the effect of multiple design choices on transformer-based program induction and synthesis algorithms, pointing to shortcom ings of current methods and suggesting multiple avenues for future work. A careful analysis of baselines (sec. 4.3) shows that there is both an initial promise and a long road ahead in the quest for building effective solutions to this problem. |
| Researcher Affiliation | Academia | Ferran Alet * 1 Javier Lopez-Contreras * 1 James Koppel 1 Maxwell Nye 1 Armando Solar-Lezama 1 Tomás Lozano-Pérez 1 Leslie Pack Kaelbling 1 Joshua B. Tenenbaum 1 1Massachusetts Institute of Technology, Cambridge Massachusetts, USA. |
| Pseudocode | No | The paper references 'pseudo-code' when discussing related work, but it does not provide any pseudocode or algorithm blocks for its own methods. |
| Open Source Code | No | The paper mentions: 'Note: since the camera-ready, we have made a fnal ver sion of the dataset with more programs and increased diversity. The dataset description is the same, but met rics and statistics change, and get more detailed. You can fnd it the updated PDF and materials at: https: //lis.csail.mit.edu/progres.' While this URL likely contains code, the paper text itself does not unambiguously state that the *source code for the methodology* is available there. |
| Open Datasets | Yes | In this work, we present PROGRES (Programs from Real Executed Subproblems), a large meta-dataset of program induction tasks, enabling future methods in few-shot program induction and synthesis. You can fnd it the updated PDF and materials at: https: //lis.csail.mit.edu/progres. |
| Dataset Splits | Yes | Therefore, we choose to divide between meta-train, meta-validation and meta-test at the level of contests: training takes contests <1000, validation between 1000 and 1249 and testing more than 1250. Given a subprogram, we can generate the data for a single task; consisting of 20 input-output pairs (10 training, 10 test). |
| Hardware Specification | No | To effciently evaluate these programs we leveraged the MIT supercloud (Reuther et al., 2018), parallelizing program evaluations over 4800 CPU cores. While '4800 CPU cores' is mentioned, specific CPU models or other hardware details (like GPU models or memory) are not provided. |
| Software Dependencies | No | The paper mentions using 'Cling C++ interpreter (Vassilev et al.)' and relying on 'cppyy (Lavrijsen & Dutta, 2016)', as well as 'SPoC and Dr Repair (Kulal et al., 2019; Yasunaga & Liang, 2020)'. However, specific version numbers for these software components are not provided. |
| Experiment Setup | Yes | We used the BART (Lewis et al., 2019) pre-trained transformer model as a base, and fne-tuned it on our dataset to output programs. At evaluation time, we perform a beam search of beam size 10, and select the program out of those 10 candidates which performs the best on the support set and execute that program on the query set to produce fnal predictions for each query example. Performance consistently improves from 5 to 10 examples. Adding the text context improved performance by a surprisingly high amount. |