Neural GPUs Learn Algorithms
Authors: [code] [video] Lukasz Kaiser, Ilya Sutskever
ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present experiments showing that a Neural GPU can successfully learn a number of algorithmic tasks and generalize well beyond the lengths that it was trained on. and Table 1: Neural GPU, stack RNN, and LSTM+A results on addition and multiplication. The table shows the fraction of test cases for which every single bit of the model s output is correct. |
| Researcher Affiliation | Industry | Łukasz Kaiser & Ilya Sutskever Google Brain {lukaszkaiser,ilyasu}@google.com |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found. The paper describes the architecture and operations using mathematical formulas and descriptive text. |
| Open Source Code | Yes | all less relevant details can be found in the code which is released as open-source.1 The code is at https://github.com/tensorflow/models/tree/master/neural_gpu. |
| Open Datasets | No | The paper uses custom-generated random training data examples ('10k random training data examples for each training length') but does not provide concrete access information (link, DOI, repository) for a publicly available or open dataset. |
| Dataset Splits | No | The paper describes the use of a 'curriculum progress threshold' which implies a validation process, but it does not specify exact percentages, sample counts, or explicit splits for validation data. |
| Hardware Specification | Yes | The joint forward-backward step time for this network was about 0.6s on an NVIDIA GTX 970 GPU. |
| Software Dependencies | No | The paper mentions using the 'Adam optimizer' but does not specify any software dependencies with version numbers (e.g., specific Python libraries like TensorFlow or PyTorch with their versions). |
| Experiment Setup | Yes | The number of layers was set to l = 2, the width of mental images was constant at w = 4, the number of maps in each mental image point was m = 24, and the convolution kernels width and height was always kw = kh = 3. and For the results presented in this paper we used the Adam optimizer (Kingma & Ba, 2014) with ε = 10−4 and gradients norm clipped to 1. and We consider 3 settings of the learning rate, initial parameters scale, and 4 other hyperparameters discussed below: the relaxation pull factor, curriculum progress threshold, gradient noise scale, and dropout. |