Neural GPUs Learn Algorithms

Authors: [code] [video] Lukasz Kaiser, Ilya Sutskever

ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present experiments showing that a Neural GPU can successfully learn a number of algorithmic tasks and generalize well beyond the lengths that it was trained on. and Table 1: Neural GPU, stack RNN, and LSTM+A results on addition and multiplication. The table shows the fraction of test cases for which every single bit of the model s output is correct.
Researcher Affiliation Industry Łukasz Kaiser & Ilya Sutskever Google Brain {lukaszkaiser,ilyasu}@google.com
Pseudocode No No structured pseudocode or algorithm blocks were found. The paper describes the architecture and operations using mathematical formulas and descriptive text.
Open Source Code Yes all less relevant details can be found in the code which is released as open-source.1 The code is at https://github.com/tensorflow/models/tree/master/neural_gpu.
Open Datasets No The paper uses custom-generated random training data examples ('10k random training data examples for each training length') but does not provide concrete access information (link, DOI, repository) for a publicly available or open dataset.
Dataset Splits No The paper describes the use of a 'curriculum progress threshold' which implies a validation process, but it does not specify exact percentages, sample counts, or explicit splits for validation data.
Hardware Specification Yes The joint forward-backward step time for this network was about 0.6s on an NVIDIA GTX 970 GPU.
Software Dependencies No The paper mentions using the 'Adam optimizer' but does not specify any software dependencies with version numbers (e.g., specific Python libraries like TensorFlow or PyTorch with their versions).
Experiment Setup Yes The number of layers was set to l = 2, the width of mental images was constant at w = 4, the number of maps in each mental image point was m = 24, and the convolution kernels width and height was always kw = kh = 3. and For the results presented in this paper we used the Adam optimizer (Kingma & Ba, 2014) with ε = 10−4 and gradients norm clipped to 1. and We consider 3 settings of the learning rate, initial parameters scale, and 4 other hyperparameters discussed below: the relaxation pull factor, curriculum progress threshold, gradient noise scale, and dropout.