The Kanerva Machine: A Generative Distributed Memory

Authors: Yan Wu, Greg Wayne, Alex Graves, Timothy Lillicrap

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we demonstrate that the adaptive memory significantly improves generative models trained on both the Omniglot and CIFAR datasets. Compared with the Differentiable Neural Computer (DNC) and its variants, our memory model has greater capacity and is significantly easier to train. ... 4 EXPERIMENTS Details of our model implementation are described in Appendix C. We use straightforward encoder and decoder models in order to focus on evaluating the improvements provided by an adaptive memory. In particular, we use the same model architecture for all experiments with both Omniglot and CIFAR dataset, changing only the the number of filters in the convolutional layers, memory size, and code size. We always use the on-line version of the update rule (section 3.3). The Adam optimiser was used for all training and required minimal tuning for our model (Kingma & Ba, 2014). In all experiments, we report the value of variational lower bound (eq. 12) L divided by the length of episode T, so the per-sample value can be compared with the likelihood from existing models. We first used the Omniglot dataset to test our model.
Researcher Affiliation Industry Yan Wu, Greg Wayne, Alex Graves, Timothy Lillicrap Deep Mind {yanwu,gregwayne,gravesa,countzero}@google.com
Pseudocode Yes Algorithm 1 Iterative Reading Input: Memory M, a (potentially noisy) query xt, the number of iteration n Output: An estimate of the noiseless ˆxt ... Algorithm 2 Writing Input: Images {xt}T t=1, Memory M with parameters R and U Output: Updated memory M
Open Source Code No No explicit statement providing a link to or stating the release of open-source code for the methodology was found.
Open Datasets Yes We first used the Omniglot dataset to test our model. This dataset contains images of hand-written characters with 1623 different classes and 20 examples in each class (Lake et al., 2015). ... We also tested our model with the CIFAR dataset, in which each 32 32 3 real-valued colour image contains much more information than a binary omniglot pattern.
Dataset Splits Yes We first use the 28 28 binary Omniglot from Burda et al. (2015) and follow the same split of 24,345 training and 8,070 test examples.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions 'The Adam optimiser' (Kingma & Ba, 2014) and 'Python Image Library' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes For all experiments, we use a convolutional encoder to convert input images into 2C embedding vectors e(xt), where C is the code size (dimension of zt). The convolutional encoder has 3 consecutive blocks, where each block is a convolutional layer with 4 4 filter with stride 2, which reduces the input dimension, followed by a basic Res Net block without bottleneck (He et al., 2016). All the convolutional layers have the same number of filters, which is either 16 or 32 depending on the dataset. ... We always use the on-line version of the update rule (section 3.3). The Adam optimiser was used for all training and required minimal tuning for our model (Kingma & Ba, 2014). In all experiments, we report the value of variational lower bound (eq. 12) L divided by the length of episode T, so the per-sample value can be compared with the likelihood from existing models. ... We use a 64 100 memory M, and a smaller 64 50 address matrix A. For simplicity, we always randomly sample 32 images from the entire training set to form an episode , and ignore the class labels. This represents a worst case scenario since the images in an episode will tend to have relatively little redundant information for compression. We use a mini-batch size of 16, and optimise the variational lower-bound (eq. 12) using Adam with learning rate 1 10 4. ... To accommodate the increased complexity of CIFAR, we use convolutional coders with 32 features at each layer, use a code size of 200, and a 128 200 memory with 128 50 address matrix. All other settings are identical to experiments with Omniglot. ... We found that adding noise to the input into qφ (yt|xt) helped stabilise training, possibly by restricting the information in the addresses. The exact magnitude of the added noise matters little, and we use Gaussian noise with zero mean and standard deviation of 0.2 for all experiments. We use Bernoulli likelihood function for Omniglot dataset, and Gaussian likelihood function for CIFAR. To avoid Gaussian likelihood collapsing, we added uniform noise U(0, 1 256) to CIFAR images during training.