Training Neural Networks for and by Interpolation

Authors: Leonard Berrada, Andrew Zisserman, M. Pawan Kumar

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically compare ALI-G to the optimization algorithms most commonly used in deep learning. Our experiments span a variety of architectures and tasks: (i) learning a differentiable neural computer; (ii) training wide residual networks on SVHN; (iii) training a Bi-LSTM on the Stanford Natural Language Inference data set; and (iv) training wide residual networks and densely connected networks on the CIFAR data sets.
Researcher Affiliation Collaboration Leonard Berrada 1 Andrew Zisserman 2 M. Pawan Kumar 2 1Deep Mind, London, United Kingdom. Work performed while at University of Oxford. 2Department of Engineering Science, University of Oxford, Oxford, United Kingdom.
Pseudocode Yes Algorithm 1 The ALI-G algorithm
Open Source Code Yes The code to reproduce our results is publicly available1. 1https://github.com/oval-group/ali-g
Open Datasets Yes training a wide residual network on the SVHN data set; (iii) training a Bi-LSTM on the SNLI data set; and (iv) training wide residual networks and densely connected networks on the CIFAR data sets. [...] We demonstrate the scalability of ALI-G by training a Res Net-18 (He et al., 2016) on the Image Net data set.
Dataset Splits Yes From the 73k difficult training examples, we select 6k samples for validation; we use all remaining (both difficult and easy) examples for training, for a total of 598k samples. [...] We use 45k samples for training and 5k for validation.
Hardware Specification No All experiments are performed either on a 12-core CPU (differentiable neural computer), on a single GPU (SVHN, SNLI, CIFAR) or on up to 4 GPUs (Image Net). No specific models (e.g., Intel Core i7, NVIDIA V100) or detailed specifications are provided.
Software Dependencies No In the Tensor Flow (Abadi et al., 2015) experiment, we use the official and publicly available implementation of L42. In the Py Torch (Paszke et al., 2017) experiments, we use our implementation of L4, which we unit-test against the official Tensor Flow implementation. No specific version numbers for TensorFlow, PyTorch, or any other software dependencies are provided.
Experiment Setup Yes We vary the initial learning rate as powers of ten between 10 4 and 104 for each method except for L4Adam and L4Mom. [...] The gradient norm is clipped for all methods except for ALI-G, L4Adam and L4Mom. [...] The ℓ2 regularization is crossvalidated in {0.0001, 0.0005} for all methods but ALI-G. For ALI-G, the regularization is expressed as a constraint on the ℓ2-norm of the parameters, and its maximal value is set to 50. SGD, ALI-G and BPGrad use a Nesterov momentum of 0.9. All methods use a dropout rate of 0.4 and a fixed budget of 160 epochs.