Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning to Execute Programs with Instruction Pointer Attention Graph Neural Networks

Authors: David Bieber, Charles Sutton, Hugo Larochelle, Daniel Tarlow

NeurIPS 2020 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To test the models, we propose evaluating systematic generalization on learning to execute using control flow graphs, which tests sequential reasoning and use of program structure. More practically, we evaluate these models on the task of learning to execute partial programs, as might arise if using the model as a heuristic function in program synthesis. Results show that the IPA-GNN outperforms a variety of RNN and GNN baselines on both tasks.
Researcher Affiliation Industry David Bieber Google EMAIL Charles Sutton Google EMAIL Hugo Larochelle Google EMAIL Daniel Tarlow Google EMAIL
Pseudocode No No pseudocode or algorithm blocks were found.
Open Source Code No No statement providing concrete access to source code for the methodology described in this paper was found.
Open Datasets No The paper describes generating its own dataset: 'We draw our dataset from a probabilistic grammar over programs using a subset of the Python programming language.' It does not provide access information (link, DOI, formal citation) for a publicly available or open dataset.
Dataset Splits Yes We draw our dataset from a probabilistic grammar over programs using a subset of the Python programming language... From this grammar we sample 5M examples with complexity c(x) C to comprise Dtrain. For this filtering, we use program length as our complexity measure c, with complexity threshold C = 10. We then sample 4.5k additional samples with c(x) > C to comprise Dtest, filtering to achieve 500 samples each at complexities {20, 30, . . . , 100}. For each model class, we select the best model parameters using accuracy on a withheld set of examples from the training split each with complexity precisely C.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory amounts) used for running experiments were mentioned in the paper.
Software Dependencies No No specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) were found.
Experiment Setup Yes We train the models for three epochs using the Adam optimizer [17] and a standard cross-entropy loss using a dense output layer and a batch size of 32. We perform a sweep, varying the hidden dimension H {200, 300} and learning rate l {0.003, 0.001, 0.0003, 0.0001} of the model and training procedure.