Programming With a Differentiable Forth Interpreter

Authors: Matko Bošnjak, Tim Rocktäschel, Jason Naradowsky, Sebastian Riedel

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show empirically that our interpreter is able to effectively leverage different levels of prior program structure and learn complex transduction tasks such as sequence sorting or addition with substantially less data and better generalisation over problem sizes. In addition, we introduce neural program optimisations based on symbolic computation and parallel branching that lead to significant speed improvements.
Researcher Affiliation Academia Matko Boˇsnjak, Tim Rockt aschel, Jason Naradowsky & Sebastian Riedel Department of Computer Science University College London London, UK {m.bosnjak, t.rocktaschel, j.narad, s.riedel}@cs.ucl.ac.uk
Pseudocode No The paper includes examples of Forth code (Listing 1 and 2) and mathematical descriptions of differentiable Forth words (Table 4), but does not provide structured pseudocode or algorithm blocks for its overall methodology.
Open Source Code No The paper does not provide any concrete access information (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described.
Open Datasets No We test 4 on the sorting and addition tasks presented in Reed & de Freitas (2015) with varying levels of program structure. This cites the source of the *tasks*, but does not explicitly provide access information for the *datasets* used in their experiments. No specific dataset names like MNIST, CIFAR, etc., are mentioned with links or formal citations.
Dataset Splits No The paper states 'Hyperparameters were tuned via random search on a development variant of each task', implying a validation set, but does not provide specific details on dataset splits (e.g., percentages or sample counts) for training, validation, or testing.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions the use of Adam optimizer, but does not provide specific version numbers for any software dependencies, libraries, or programming languages used.
Experiment Setup Yes The parameters of each sketch are trained using Adam (Kingma & Ba, 2014), with gradient clipping and gradient noise (Neelakantan et al., 2015b). Hyperparameters were tuned via random search on a development variant of each task, for 1000 epochs, repeating each experiment 5 times. During testing we employ memory element discretisation, replacing differentiable stacks and pointers with their discrete counterparts, and effectively allowing the trained model to generalize to any sequence length if the correct sketch behavior has been learned. To illustrate the generalization ability of this architecture, we compare against a Seq2Seq (Sutskever et al., 2014) baseline. All Seq2Seq models are single-layer, with a hidden size of 50, trained similarly for 1000 epochs using Adam.