Identity Matters in Deep Learning
Authors: Moritz Hardt, Tengyu Ma
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we put the principle of identity parameterization on a more solid theoretical footing alongside further empirical progress. ... Directly inspired by our theory, we experiment with a radically simple residual architecture consisting of only residual convolutional layers and Re Lu activations, but no batch normalization, dropout, or max pool. Our model improves significantly on previous all-convolutional networks on the CIFAR10, CIFAR100, and Image Net classification benchmarks. |
| Researcher Affiliation | Collaboration | Moritz Hardt Google Brain 1600 Amphitheatre Parkway, Mountain View, CA, 94043 m@mrtz.org Tengyu Ma Department of Computer Sciene Princeton University 35 Olden Street, Princeton, 08540 tengyu@cs.princeton.edu |
| Pseudocode | No | No pseudocode or algorithm block found. The paper primarily uses mathematical equations and descriptive text. |
| Open Source Code | No | Our code can be easily derived from an open source implementation3 by removing batch normalization, adjusting the residual components and model architecture. 3https://github.com/tensorflow/models/tree/master/resnet |
| Open Datasets | Yes | Our model improves significantly on previous all-convolutional networks on the CIFAR10, CIFAR100, and Image Net classification benchmarks. ... Inspired by our theory, we experimented with all-convolutional residual networks on standard image classification benchmarks. 4.1 CIFAR10 AND CIFAR100 ... The Image Net ILSVRC 2012 data set has 1, 281, 167 data points with 1000 classes. |
| Dataset Splits | No | The Image Net ILSVRC 2012 data set has 1, 281, 167 data points with 1000 classes. ... Our model still reached 35.29% top-1 classification error on the test set (50000 data points)... An interesting aspect of our model is that despite its massive size of 13.59 million trainable parameters, the model does not seem to overfit too quickly even though the data set size is 50000. In contrast, we found it difficult to train a model with batch normalization of this size without significant overfitting on CIFAR10. |
| Hardware Specification | Yes | Our model reaches peak performance at around 50k steps, which takes about 24h on a single NVIDIA Tesla K40 GPU. ... Training was distributed across 6 machines updating asynchronously. Each machine was equipped with 8 GPUs (NVIDIA Tesla K40) and used batch size 256 split across the 8 GPUs so that each GPU updated with batches of size 32. |
| Software Dependencies | No | We trained our models with the Tensorflow framework, using a momentum optimizer with momentum 0.9, and batch size is 128. |
| Experiment Setup | Yes | We trained our models with the Tensorflow framework, using a momentum optimizer with momentum 0.9, and batch size is 128. All convolutional weights are trained with weight decay 0.0001. The initial learning rate is 0.05, which drops by a factor 10 and 30000 and 50000 steps. ... We trained the model with a momentum optimizer (with momentum 0.9) and a learning rate schedule that decays by a factor of 0.94 every two epochs, starting from the initial learning rate 0.1. ... used batch size 256 split across the 8 GPUs so that each GPU updated with batches of size 32. |