Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks

Authors: Sanjeev Arora, Simon S. Du, Zhiyuan Li, Ruslan Salakhutdinov, Ruosong Wang, Dingli Yu

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We report results suggesting neural tangent kernels perform strongly on low-data tasks. 1. On a standard testbed of classification/regression tasks from the UCI database, NTK SVM beats the previous gold standard, Random Forests (RF), and also the corresponding finite nets. 2. On CIFAR-10 with 10 640 training samples, Convolutional NTK consistently beats Res Net-34 by 1% 3%. 3. On VOC07 testbed for few-shot image classification tasks on Image Net with transfer learning (Goyal et al., 2019), replacing the linear SVM currently used with a Convolutional NTK SVM consistently improves performance. 4. Comparing the performance of NTK with the finite-width net it was derived from, NTK behavior starts at lower net widths than suggested by theoretical analysis(Arora et al., 2019a).
Researcher Affiliation Academia Sanjeev Arora Princeton University arora@cs.princeton.edu Simon S. Du Institute for Advanced Study ssdu@ias.edu Zhiyuan Li Princeton University zhiyuanli@cs.princeton.edu Ruslan Salakhutdinov Carnegie Mellon University rsalakhu@cs.cmu.edu Ruosong Wang Carnegie Mellon University ruosongw@andrew.cmu.edu Dingli Yu Princeton University dingliy@cs.princeton.edu
Pseudocode No No structured pseudocode or algorithm blocks found.
Open Source Code No The authors plan to release the code, to allow off-the-shelf use of this method. It does not require GPUs.
Open Datasets Yes On a standard testbed of classification/regression tasks from the UCI database...On CIFAR-10 with 10 640 training samples...On VOC07 testbed for few-shot image classification tasks on Image Net...Features are extracted from layer conv1, conv2, conv3, conv4, conv5 in Res Net-50 (He et al., 2016) trained on Image Net (Deng et al., 2009)...See Table 6 in Appendix A for a summary of datasets we used. The detailed experiment setup, including the choices the datasets, training / test splitting and ranges of hyperparameters, is described in Appendix A. We note that usual methods of obtaining confidence bounds in these low-data settings are somewhat heuristic. ...pre-processed datasets by Fern andez Delgado et al. (2014) from the link: http://persoal.citius.usc.es/manuel.fernandez. delgado/papers/jmlr/data.tar.gz
Dataset Splits Yes We follow the comparison setup in Fern andez-Delgado et al. (2014) that we report 4-fold cross-validation. For hyperparameters, we tune them with the same validation methodology in Fern andez-Delgado et al. (2014): all available training samples are randomly split into one training and one test set, while imposing that each class has the same number of training and test samples. Then the parameter with best validation accuracy is selected. It is possible to give confidence bounds for this parameter tuning scheme, but they are worse than standard ones for separated training/validation/testing data.
Hardware Specification No We thank Amazon Web Services for providing compute time for the experiments in this paper, and NVIDIA for GPU support.
Software Dependencies No We use sklearn.svm.Linear SVC to train linear SVMs, and sklearn.svm.SVC to train kernel SVMs (for CNTK).
Experiment Setup Yes The detailed experiment setup, including the choices the datasets, training / test splitting and ranges of hyperparameters, is described in Appendix A. ...NTK Specification We calculate NTK induced fully-connected neural networks with L layers where L bottom layers are fixed, and then use C-support vector classification implemented by sklearn.svm. We tune hyperparameters L from 1 to 5, L from 0 to L 1, and cost value C as powers of ten from 2 to 4. ...NN Specification We use fully-connected NN with L layers, 512 number of hidden nodes per layer and use gradient descent to train the neural network. We tune hyperparameters L from 1 to 5, with / without batch normalization and learning rate 0.1 or 1. We run gradient descent for 2000 epochs. ...We use Res Net-34 with width 64,128,256 and default hyperparameters: learning rate 0.1, momentum 0.9, weight decay 0.0005. We decay the learning rate by 10 at the epoch of 80 and 120, with 160 training epochs in total. The training batch size is the minimum of the size of the whole training dataset and 160.