Neural Programmer: Inducing Latent Programs with Gradient Descent

Authors: Arvind Neelakantan, Quoc Le, Ilya Sutskever

ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On a fairly complex synthetic table-comprehension dataset, traditional recurrent networks and attentional models perform poorly while Neural Programmer typically obtains nearly perfect accuracy.
Researcher Affiliation Collaboration Arvind Neelakantan University of Massachusetts Amherst arvind@cs.umass.edu Quoc V. Le Google Brain qvl@google.com Ilya Sutskever Google Brain ilyasu@google.com
Pseudocode Yes Algorithm 1 High-level view of Neural Programmer during its inference stage for an input example.
Open Source Code No The paper does not provide any specific link or statement about open-source code release.
Open Datasets No Our reason for using synthetic data is that it is easier to understand a new model with a synthetic dataset. We can generate the data in a large quantity, whereas the biggest real-word semantic parsing datasets we know of contains only about 14k training examples (Pasupat & Liang, 2015) which is very small by neural network standards.
Dataset Splits No The paper mentions a 'training set' and a 'test set' but does not explicitly provide details about a separate validation set split or cross-validation for its own experiments, nor exact percentages or sample counts for the splits.
Hardware Specification No The paper does not specify any hardware details (e.g., CPU, GPU models, or cloud computing resources) used for the experiments.
Software Dependencies No The paper mentions using 'Adam optimizer' but does not provide specific version numbers for any software dependencies like libraries, frameworks, or programming languages.
Experiment Setup Yes We use 4 time steps in our experiments (T = 4). Neural Programmer is trained with mini-batch stochastic gradient descent with Adam optimizer (Kingma & Ba, 2014). The parameters are initialized uniformly randomly within the range [-0.1, 0.1]. In all experiments, we set the mini-batch size to 50, dimensionality d to 256, the initial learning rate and the momentum hyper-parameters of Adam to their default values (Kingma & Ba, 2014). We use a schedule inspired from Welling & Teh (2011), where at every step we sample a Gaussian of 0 mean and variance= curr step 0.55. To prevent exploding gradients, we perform gradient clipping by scaling the gradient when the norm exceeds a threshold (Graves, 2013). The threshold value is picked from [1, 5, 50]. We tune the ϵ hyper-parameter in Adam from [1e-6, 1e-8], the Huber constant δ from [10, 25, 50] and λ (weight between two losses) from [25, 50, 75, 100] using grid search.