Gated Linear Networks

Authors: Joel Veness, Tor Lattimore, David Budden, Avishkar Bhoopchand, Christopher Mattern, Agnieszka Grabska-Barwinska, Eren Sezener, Jianan Wang, Peter Toth, Simon Schmitt, Marcus Hutter10015-10023

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that this architecture gives rise to universal learning capabilities in the limit, with effective model capacity increasing as a function of network size in a manner comparable with deep Re LU networks. Furthermore, we demonstrate that the GLN learning mechanism possesses extraordinary resilience to catastrophic forgetting, performing almost on par to an MLP with dropout and Elastic Weight Consolidation on standard benchmarks. ... We ran two sets of experiments: first, using the standard MNIST dataset with shuffled labels; and second, replacing the MNIST images with uniform noise of the same shape and dataset length. These results are presented in Figure 2 compared to an MLP baseline in an equivalent one-vs-all configuration.
Researcher Affiliation Industry Joel Veness, Tor Lattimore, David Budden, Avishkar Bhoopchand, Christopher Mattern, Agnieszka Grabska-Barwinska, Eren Sezener, Jianan Wang, Peter Toth, Simon Schmitt and Marcus Hutter Deep Mind aixi@google.com, lattimore@google.com, budden@google.com, avishkar@google.com
Pseudocode Yes Algorithm 1 GLN(Θ, z, p, x, η, update). Perform a forward pass and optionally update weights. Each layer performs clipped geometric mixing over the outputs of the previous layer, where the mixing weights are side-infodependent via the gating function (Line 10). Lines 12-13 apply (optionally) the weight update, which is obtained from Equation 2.
Open Source Code No The paper does not provide concrete access to source code, such as a specific repository link or an explicit code release statement.
Open Datasets Yes First we explore the use of GLNs for online (single-pass) classification of the deskewed MNIST dataset (Lecun et al. 1998). ... We next compare GLNs to a variety of general purpose batch learning techniques (SVMs, Gradient Boosting for Classification, MLPs) in small data regimes on a selection of standard UCI datasets. ... Our final result is to use GLNs and image specific gating to construct an online image density model for the binarized MNIST dataset (Larochelle and Murray 2011), a standard benchmark for image density modeling.
Dataset Splits Yes A 1000-500 neuron GLN with context-dimension 8 was trained with a single pass over 80% of instances and evaluated with frozen weights on the remainder. ... Running our method online (i.e. a single pass of the concatenated training, validation and test sets) gave an average loss of 79.0 nats per image across the test data, and 80.74 nats per image if we held the parameters fixed upon reaching the test set.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions using the 'Adam optimizer (Kingma and Ba 2014)' but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes For GLNs, we select a fixed layer width of 128 and vary both the context dimension and number of layers. ... The GLN was trained with learning rate 10 4 and the MLP using the Adam optimizer (Kingma and Ba 2014) with learning rate 10 5, both selected by conducting a sweep over learning rates from 10 1 to 10 6. ... The learning rate at each step t was set to min{100/t, 0.01}. ... The comparison MLP used Re LU activations and the same number of weights, and was trained for 100 epochs using the Adam optimizer (Kingma and Ba 2014) with learning rate 0.001 and batch size 32.