reproducibilityindex.ai

A large-scale benchmark for few-shot program induction and synthesis

Authors: Ferran Alet, Javier Lopez-Contreras, James Koppel, Maxwell Nye, Armando Solar-Lezama, Tomas Lozano-Perez, Leslie Kaelbling, Joshua Tenenbaum

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We analyze the effect of multiple design choices on transformer-based program induction and synthesis algorithms, pointing to shortcom ings of current methods and suggesting multiple avenues for future work. A careful analysis of baselines (sec. 4.3) shows that there is both an initial promise and a long road ahead in the quest for building effective solutions to this problem.
Researcher Affiliation	Academia	Ferran Alet * 1 Javier Lopez-Contreras * 1 James Koppel 1 Maxwell Nye 1 Armando Solar-Lezama 1 Tomás Lozano-Pérez 1 Leslie Pack Kaelbling 1 Joshua B. Tenenbaum 1 1Massachusetts Institute of Technology, Cambridge Massachusetts, USA.
Pseudocode	No	The paper references 'pseudo-code' when discussing related work, but it does not provide any pseudocode or algorithm blocks for its own methods.
Open Source Code	No	The paper mentions: 'Note: since the camera-ready, we have made a fnal ver sion of the dataset with more programs and increased diversity. The dataset description is the same, but met rics and statistics change, and get more detailed. You can fnd it the updated PDF and materials at: https: //lis.csail.mit.edu/progres.' While this URL likely contains code, the paper text itself does not unambiguously state that the source code for the methodology is available there.
Open Datasets	Yes	In this work, we present PROGRES (Programs from Real Executed Subproblems), a large meta-dataset of program induction tasks, enabling future methods in few-shot program induction and synthesis. You can fnd it the updated PDF and materials at: https: //lis.csail.mit.edu/progres.
Dataset Splits	Yes	Therefore, we choose to divide between meta-train, meta-validation and meta-test at the level of contests: training takes contests <1000, validation between 1000 and 1249 and testing more than 1250. Given a subprogram, we can generate the data for a single task; consisting of 20 input-output pairs (10 training, 10 test).
Hardware Specification	No	To effciently evaluate these programs we leveraged the MIT supercloud (Reuther et al., 2018), parallelizing program evaluations over 4800 CPU cores. While '4800 CPU cores' is mentioned, specific CPU models or other hardware details (like GPU models or memory) are not provided.
Software Dependencies	No	The paper mentions using 'Cling C++ interpreter (Vassilev et al.)' and relying on 'cppyy (Lavrijsen & Dutta, 2016)', as well as 'SPoC and Dr Repair (Kulal et al., 2019; Yasunaga & Liang, 2020)'. However, specific version numbers for these software components are not provided.
Experiment Setup	Yes	We used the BART (Lewis et al., 2019) pre-trained transformer model as a base, and fne-tuned it on our dataset to output programs. At evaluation time, we perform a beam search of beam size 10, and select the program out of those 10 candidates which performs the best on the support set and execute that program on the query set to produce fnal predictions for each query example. Performance consistently improves from 5 to 10 examples. Adding the text context improved performance by a surprisingly high amount.