Measuring and regularizing networks in function space

Authors: Ari Benjamin, David Rolnick, Konrad Kording

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here, we show that it is simple and computationally feasible to calculate distances between functions in a L2 Hilbert space. We examine how typical networks behave in this space, and compare how parameter ℓ2 distances compare to function L2 distances between various points of an optimization trajectory. We find that the two distances are nontrivially related. In the first setting we consider multitask learning, and the phenomenon of catastrophic forgetting that makes it difficult. In the second setting we propose a learning rule for supervised learning that constrains how much a network s function can change any one update. We compared HCGD and SGD on feedforward and recurrent architectures. We tested HCGD as applied to the CIFAR-10 image classification problem. We next tested the performance of HCGD on a recurrent task. We trained an LSTM on the sequential MNIST task
Researcher Affiliation Academia Ari S. Benjamin 1, David Rolnick1, and Konrad P. Kording1 1University of Pennsylvania, Philadelphia, PA, 19142
Pseudocode Yes Algorithm 1: Hilbert-constrained gradient descent. Implements Equation 2. Algorithm 2: Hilbert-constrained gradient descent. This version of the algorithm includes momentum. Algorithm 3: Natural gradient by gradient descent.
Open Source Code Yes A Pytorch implementation of the HCGD optimizer can be found at https://github.com/KordingLab/hilbert-constrained-gradient-descent.
Open Datasets Yes train three random initializations on a 5000-image subset of CIFAR-10. We compared the performance of our approach at the benchmark task of permuted MNIST. We tested HCGD as applied to the CIFAR-10 image classification problem. We trained an LSTM on the sequential MNIST task.
Dataset Splits No The paper mentions using a 'single large validation batch' and that 'The batch size for the validation batch to be 256. While the examples in each validation batch were different than the training batch, they were also drawn from the train set.' However, it does not specify the exact percentages or counts for training, validation, and test splits for the overall datasets (CIFAR-10 or MNIST), nor does it reference standard pre-defined splits explicitly for reproducibility.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. It focuses on the software implementation and algorithms.
Software Dependencies No The paper states 'All models were implemented in Py Torch (Paszke et al. (2017)).' While PyTorch is named, a specific version number is not provided, nor are any other software dependencies with their respective versions.
Experiment Setup Yes The network is the same as in Figure 1: a CNN with four convolutional layers with batch normalization, followed by two fully-connected layers, trained with SGD with learning rate = 0.1, momentum = 0.9, and weight decay = 1e-4. We chose λ = 1.3 as the regularizing hyperparameter from a logarithmic grid search. The ADAM method is ADAM with a learning rate of 0.001. We used a tuned learning rate ϵ for SGD, and then used the same learning rate for HCGD. We use values of λ = 0.5 and η = 0.02, generally about 10 times less than the principal learning rate ϵ. We chose the batch size for the validation batch to be 256. The learning rate ϵ is decreased by a factor of 10 at epoch 150. For the train error we overlay the running average of each trace for clarity. We used 1 correction step, as before, but found that using more correction steps yielded even better performance. We trained an LSTM on the sequential MNIST task, in which pixels are input one at a time. Shown are the traces for SGD and Adam (both with learning rate 0.01).