reproducibilityindex.ai

Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks

Authors: Sanjeev Arora, Simon S. Du, Zhiyuan Li, Ruslan Salakhutdinov, Ruosong Wang, Dingli Yu

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We report results suggesting neural tangent kernels perform strongly on low-data tasks. 1. On a standard testbed of classiﬁcation/regression tasks from the UCI database, NTK SVM beats the previous gold standard, Random Forests (RF), and also the corresponding ﬁnite nets. 2. On CIFAR-10 with 10 640 training samples, Convolutional NTK consistently beats Res Net-34 by 1% 3%. 3. On VOC07 testbed for few-shot image classiﬁcation tasks on Image Net with transfer learning (Goyal et al., 2019), replacing the linear SVM currently used with a Convolutional NTK SVM consistently improves performance. 4. Comparing the performance of NTK with the ﬁnite-width net it was derived from, NTK behavior starts at lower net widths than suggested by theoretical analysis(Arora et al., 2019a).
Researcher Affiliation	Academia	Sanjeev Arora Princeton University arora@cs.princeton.edu Simon S. Du Institute for Advanced Study ssdu@ias.edu Zhiyuan Li Princeton University zhiyuanli@cs.princeton.edu Ruslan Salakhutdinov Carnegie Mellon University rsalakhu@cs.cmu.edu Ruosong Wang Carnegie Mellon University ruosongw@andrew.cmu.edu Dingli Yu Princeton University dingliy@cs.princeton.edu
Pseudocode	No	No structured pseudocode or algorithm blocks found.
Open Source Code	No	The authors plan to release the code, to allow off-the-shelf use of this method. It does not require GPUs.
Open Datasets	Yes	On a standard testbed of classiﬁcation/regression tasks from the UCI database...On CIFAR-10 with 10 640 training samples...On VOC07 testbed for few-shot image classiﬁcation tasks on Image Net...Features are extracted from layer conv1, conv2, conv3, conv4, conv5 in Res Net-50 (He et al., 2016) trained on Image Net (Deng et al., 2009)...See Table 6 in Appendix A for a summary of datasets we used. The detailed experiment setup, including the choices the datasets, training / test splitting and ranges of hyperparameters, is described in Appendix A. We note that usual methods of obtaining conﬁdence bounds in these low-data settings are somewhat heuristic. ...pre-processed datasets by Fern andez Delgado et al. (2014) from the link: http://persoal.citius.usc.es/manuel.fernandez. delgado/papers/jmlr/data.tar.gz
Dataset Splits	Yes	We follow the comparison setup in Fern andez-Delgado et al. (2014) that we report 4-fold cross-validation. For hyperparameters, we tune them with the same validation methodology in Fern andez-Delgado et al. (2014): all available training samples are randomly split into one training and one test set, while imposing that each class has the same number of training and test samples. Then the parameter with best validation accuracy is selected. It is possible to give conﬁdence bounds for this parameter tuning scheme, but they are worse than standard ones for separated training/validation/testing data.
Hardware Specification	No	We thank Amazon Web Services for providing compute time for the experiments in this paper, and NVIDIA for GPU support.
Software Dependencies	No	We use sklearn.svm.Linear SVC to train linear SVMs, and sklearn.svm.SVC to train kernel SVMs (for CNTK).
Experiment Setup	Yes	The detailed experiment setup, including the choices the datasets, training / test splitting and ranges of hyperparameters, is described in Appendix A. ...NTK Speciﬁcation We calculate NTK induced fully-connected neural networks with L layers where L bottom layers are ﬁxed, and then use C-support vector classiﬁcation implemented by sklearn.svm. We tune hyperparameters L from 1 to 5, L from 0 to L 1, and cost value C as powers of ten from 2 to 4. ...NN Speciﬁcation We use fully-connected NN with L layers, 512 number of hidden nodes per layer and use gradient descent to train the neural network. We tune hyperparameters L from 1 to 5, with / without batch normalization and learning rate 0.1 or 1. We run gradient descent for 2000 epochs. ...We use Res Net-34 with width 64,128,256 and default hyperparameters: learning rate 0.1, momentum 0.9, weight decay 0.0005. We decay the learning rate by 10 at the epoch of 80 and 120, with 160 training epochs in total. The training batch size is the minimum of the size of the whole training dataset and 160.