Deep Frank-Wolfe For Neural Network Optimization
Authors: Leonard Berrada, Andrew Zisserman, M. Pawan Kumar
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present experiments on the CIFAR and SNLI data sets, where we demonstrate the significant superiority of our method over Adam, Adagrad, as well as the recently proposed BPGrad and AMSGrad. |
| Researcher Affiliation | Academia | Leonard Berrada1, Andrew Zisserman1 and M. Pawan Kumar1,2 1Department of Engineering Science University of Oxford 2Alan Turing Institute {lberrada,az,pawan}@robots.ox.ac.uk |
| Pseudocode | Yes | The main steps of DFW are shown in Algorithm 1. |
| Open Source Code | Yes | The code is publicly available at https://github.com/oval-group/dfw. |
| Open Datasets | Yes | We present experiments on the CIFAR and SNLI data sets... The CIFAR-10/100 data sets contain 60,000 RGB natural images of size 32 32 with 10/100 classes (Krizhevsky, 2009). The Stanford Natural Language Inference (SNLI) data set is a large corpus of 570k pairs of sentences (Bowman et al., 2015). |
| Dataset Splits | Yes | We split the training set into 45,000 training samples and 5,000 validation samples, and use 10,000 samples for testing. |
| Hardware Specification | Yes | All models are trained on a single Nvidia Titan Xp card. |
| Software Dependencies | No | Our experiments are implemented in pytorch (Paszke et al., 2017), and the code is available at https://github.com/oval-group/dfw. While PyTorch is mentioned, a specific version number is not provided. |
| Experiment Setup | Yes | For all methods, we set a budget of 200 epochs for WRN and 300 epochs for DN. Furthermore, the batch-size is respectively set to 128 and 64 for WRN and DN as in (Zagoruyko & Komodakis, 2016) and (Huang et al., 2017). For DN, the l2 regularization is set to 10−4 as in (Huang et al., 2017). For WRN, the l2 is cross-validated between 5.10−4, as in (Zagoruyko & Komodakis, 2016), and 10−4, a more usual value that we have found to perform better for some of the methods (in particular DFW, since the corresponding loss function is an SVM instead of CE, for which the value of 5.10−4 was designed). The value of the Nesterov momentum is set to 0.9 for BPGrad, SGD and DFW. DFW has only one hyper-parameter to tune, namely η, which is analogous to an initial learning rate. For SGD, the initial learning rate is set to 0.1 on both WRN and DN. |