Deep Frank-Wolfe For Neural Network Optimization

Authors: Leonard Berrada, Andrew Zisserman, M. Pawan Kumar

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present experiments on the CIFAR and SNLI data sets, where we demonstrate the significant superiority of our method over Adam, Adagrad, as well as the recently proposed BPGrad and AMSGrad.
Researcher Affiliation Academia Leonard Berrada1, Andrew Zisserman1 and M. Pawan Kumar1,2 1Department of Engineering Science University of Oxford 2Alan Turing Institute {lberrada,az,pawan}@robots.ox.ac.uk
Pseudocode Yes The main steps of DFW are shown in Algorithm 1.
Open Source Code Yes The code is publicly available at https://github.com/oval-group/dfw.
Open Datasets Yes We present experiments on the CIFAR and SNLI data sets... The CIFAR-10/100 data sets contain 60,000 RGB natural images of size 32 32 with 10/100 classes (Krizhevsky, 2009). The Stanford Natural Language Inference (SNLI) data set is a large corpus of 570k pairs of sentences (Bowman et al., 2015).
Dataset Splits Yes We split the training set into 45,000 training samples and 5,000 validation samples, and use 10,000 samples for testing.
Hardware Specification Yes All models are trained on a single Nvidia Titan Xp card.
Software Dependencies No Our experiments are implemented in pytorch (Paszke et al., 2017), and the code is available at https://github.com/oval-group/dfw. While PyTorch is mentioned, a specific version number is not provided.
Experiment Setup Yes For all methods, we set a budget of 200 epochs for WRN and 300 epochs for DN. Furthermore, the batch-size is respectively set to 128 and 64 for WRN and DN as in (Zagoruyko & Komodakis, 2016) and (Huang et al., 2017). For DN, the l2 regularization is set to 10−4 as in (Huang et al., 2017). For WRN, the l2 is cross-validated between 5.10−4, as in (Zagoruyko & Komodakis, 2016), and 10−4, a more usual value that we have found to perform better for some of the methods (in particular DFW, since the corresponding loss function is an SVM instead of CE, for which the value of 5.10−4 was designed). The value of the Nesterov momentum is set to 0.9 for BPGrad, SGD and DFW. DFW has only one hyper-parameter to tune, namely η, which is analogous to an initial learning rate. For SGD, the initial learning rate is set to 0.1 on both WRN and DN.