Improving Neural Network Training in Low Dimensional Random Bases
Authors: Frithjof Gressmann, Zach Eaton-Rosen, Carlo Luschi
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here, we revisit optimization in low-dimensional random subspaces with the aim of improving its practical optimization performance. We show that while random subspace projections have computational benefits such as easy distribution on many workers, they become less efficient with growing projection dimensionality, or if the subspace projection is fixed throughout training. We observe that applying smaller independent random projections to different parts of the network and re-drawing them at every step significantly improves the obtained accuracy on fully-connected and several convolutional architectures, including Res Nets on the MNIST, Fashion-MNIST and CIFAR-10 datasets. Table 1 reports the validation accuracy after 100 epochs (plots in Supplementary Material, Figure B.6). All methods other than SGD use a dimensionality reduction factor of 400 . |
| Researcher Affiliation | Industry | Frithjof Gressmann Graphcore Research Bristol, UK frithjof@graphcore.ai Zach Eaton-Rosen Graphcore Research London, UK zacher@graphcore.ai Carlo Luschi Graphcore Research Bristol, UK carlo@graphcore.ai |
| Pseudocode | Yes | Algorithm 1: Training procedures for a single worker (left) and for parallelized workers (right). |
| Open Source Code | Yes | Our source code is available at https://github.com/graphcore-research/random-bases |
| Open Datasets | Yes | fully-connected and several convolutional architectures, including Res Nets on the MNIST, Fashion-MNIST and CIFAR-10 datasets. |
| Dataset Splits | No | The paper mentions 'validation accuracy' for standard datasets (MNIST, FMNIST, CIFAR-10) but does not explicitly state the training, validation, or test split percentages or sample counts for these datasets, nor does it provide a citation for specific predefined splits used. |
| Hardware Specification | Yes | To meet the algorithmic demand for fast pseudo-random number generation (PRNG), we conduct these experiments using Graphcore s first generation Intelligence Processing Unit (IPU)2. The Colossus MK1 IPU (GC2) accelerator is composed of 1216 independent cores with in-core PRNG hardware units that can generate up to 944 billion random samples per second [22]. On a single IPU, random bases descent training of the CIFAR-10 CNN with the extremely sample intensive dimension d = 10k achieved a throughput of 31 images per second (100 epochs / 1.88 days), whereas training the same model on an 80 core CPU machine achieved 2.6 images/second (100 epochs / 22.5 days). To rule out the possibility that the measured speedup can be attributed to the forward-backward acceleration only, we also measured the throughput of our implementation on a GPU V100 accelerator but found no significant throughput improvement relative to the CPU baseline. |
| Software Dependencies | No | The paper mentions 'TensorFlow implementation' but does not specify a version number for TensorFlow or any other software libraries or dependencies used. |
| Experiment Setup | Yes | All networks use Re LU nonlinearities and are trained with a softmax cross-entropy loss on the image classification tasks MNIST, Fashion-MNIST (FMNIST), and CIFAR-10. Unless otherwise noted, basis vectors are drawn from a normal distribution and normalized. We do not explicitly orthogonalize, but rely on the quasi-orthogonality of random directions in high dimensions [13]. Further details can be found in the Supplementary Material. ... Table 1 reports the validation accuracy after 100 epochs ... We train the CNN on CIFAR-10 for 2000 epochs (2.5 million steps)... Input: Learning rate ηRBD, network initialization θt=0 |