Training Neural Networks for and by Interpolation
Authors: Leonard Berrada, Andrew Zisserman, M. Pawan Kumar
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically compare ALI-G to the optimization algorithms most commonly used in deep learning. Our experiments span a variety of architectures and tasks: (i) learning a differentiable neural computer; (ii) training wide residual networks on SVHN; (iii) training a Bi-LSTM on the Stanford Natural Language Inference data set; and (iv) training wide residual networks and densely connected networks on the CIFAR data sets. |
| Researcher Affiliation | Collaboration | Leonard Berrada 1 Andrew Zisserman 2 M. Pawan Kumar 2 1Deep Mind, London, United Kingdom. Work performed while at University of Oxford. 2Department of Engineering Science, University of Oxford, Oxford, United Kingdom. |
| Pseudocode | Yes | Algorithm 1 The ALI-G algorithm |
| Open Source Code | Yes | The code to reproduce our results is publicly available1. 1https://github.com/oval-group/ali-g |
| Open Datasets | Yes | training a wide residual network on the SVHN data set; (iii) training a Bi-LSTM on the SNLI data set; and (iv) training wide residual networks and densely connected networks on the CIFAR data sets. [...] We demonstrate the scalability of ALI-G by training a Res Net-18 (He et al., 2016) on the Image Net data set. |
| Dataset Splits | Yes | From the 73k difficult training examples, we select 6k samples for validation; we use all remaining (both difficult and easy) examples for training, for a total of 598k samples. [...] We use 45k samples for training and 5k for validation. |
| Hardware Specification | No | All experiments are performed either on a 12-core CPU (differentiable neural computer), on a single GPU (SVHN, SNLI, CIFAR) or on up to 4 GPUs (Image Net). No specific models (e.g., Intel Core i7, NVIDIA V100) or detailed specifications are provided. |
| Software Dependencies | No | In the Tensor Flow (Abadi et al., 2015) experiment, we use the official and publicly available implementation of L42. In the Py Torch (Paszke et al., 2017) experiments, we use our implementation of L4, which we unit-test against the official Tensor Flow implementation. No specific version numbers for TensorFlow, PyTorch, or any other software dependencies are provided. |
| Experiment Setup | Yes | We vary the initial learning rate as powers of ten between 10 4 and 104 for each method except for L4Adam and L4Mom. [...] The gradient norm is clipped for all methods except for ALI-G, L4Adam and L4Mom. [...] The ℓ2 regularization is crossvalidated in {0.0001, 0.0005} for all methods but ALI-G. For ALI-G, the regularization is expressed as a constraint on the ℓ2-norm of the parameters, and its maximal value is set to 50. SGD, ALI-G and BPGrad use a Nesterov momentum of 0.9. All methods use a dropout rate of 0.4 and a fixed budget of 160 epochs. |