Neural Programmer: Inducing Latent Programs with Gradient Descent
Authors: Arvind Neelakantan, Quoc Le, Ilya Sutskever
ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On a fairly complex synthetic table-comprehension dataset, traditional recurrent networks and attentional models perform poorly while Neural Programmer typically obtains nearly perfect accuracy. |
| Researcher Affiliation | Collaboration | Arvind Neelakantan University of Massachusetts Amherst arvind@cs.umass.edu Quoc V. Le Google Brain qvl@google.com Ilya Sutskever Google Brain ilyasu@google.com |
| Pseudocode | Yes | Algorithm 1 High-level view of Neural Programmer during its inference stage for an input example. |
| Open Source Code | No | The paper does not provide any specific link or statement about open-source code release. |
| Open Datasets | No | Our reason for using synthetic data is that it is easier to understand a new model with a synthetic dataset. We can generate the data in a large quantity, whereas the biggest real-word semantic parsing datasets we know of contains only about 14k training examples (Pasupat & Liang, 2015) which is very small by neural network standards. |
| Dataset Splits | No | The paper mentions a 'training set' and a 'test set' but does not explicitly provide details about a separate validation set split or cross-validation for its own experiments, nor exact percentages or sample counts for the splits. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., CPU, GPU models, or cloud computing resources) used for the experiments. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer' but does not provide specific version numbers for any software dependencies like libraries, frameworks, or programming languages. |
| Experiment Setup | Yes | We use 4 time steps in our experiments (T = 4). Neural Programmer is trained with mini-batch stochastic gradient descent with Adam optimizer (Kingma & Ba, 2014). The parameters are initialized uniformly randomly within the range [-0.1, 0.1]. In all experiments, we set the mini-batch size to 50, dimensionality d to 256, the initial learning rate and the momentum hyper-parameters of Adam to their default values (Kingma & Ba, 2014). We use a schedule inspired from Welling & Teh (2011), where at every step we sample a Gaussian of 0 mean and variance= curr step 0.55. To prevent exploding gradients, we perform gradient clipping by scaling the gradient when the norm exceeds a threshold (Graves, 2013). The threshold value is picked from [1, 5, 50]. We tune the ϵ hyper-parameter in Adam from [1e-6, 1e-8], the Huber constant δ from [10, 25, 50] and λ (weight between two losses) from [25, 50, 75, 100] using grid search. |