Greedy Layerwise Learning Can Scale To ImageNet

Authors: Eugene Belilovsky, Michael Eickenberg, Edouard Oyallon

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental we study CNNs on image classification tasks using the large-scale Image Net dataset and the CIFAR-10 dataset. Using a simple set of ideas for architecture and training we find that solving sequential 1-hidden-layer auxiliary problems lead to a CNN that exceeds Alex Net performance on Image Net.
Researcher Affiliation Academia 1Mila, University of Montreal 2University of California, Berkeley 3Centrale Supelec, University of Paris-Saclay / INRIA Saclay.
Pseudocode Yes Algorithm 1 Layer Wise CNN
Open Source Code No No explicit statement about providing access to source code or a link to a code repository was found in the paper.
Open Datasets Yes We performed experiments on the large-scale Image Net-1k (Russakovsky et al., 2015), a major catalyst for the recent popularity of deep learning, as well as the CIFAR-10 dataset. CIFAR-10 consists of small RGB images with respectively 50k and 10k samples for training and testing.
Dataset Splits Yes CIFAR-10 consists of small RGB images with respectively 50k and 10k samples for training and testing. Image Net consists of 1.2M RGB images of varying size for training. Our final trained model achieves 79.7% top-5 single crop accuracy on the validation set
Hardware Specification No Only general hardware information ("We use 4 GPUs to train our Image Net models.") is provided, lacking specific model numbers, processor types, or memory details.
Software Dependencies No The paper mentions optimization algorithms (SGD) and data augmentation techniques but does not specify any software names with version numbers for reproducibility.
Experiment Setup Yes We use the standard data augmentation and optimize each layer with SGD using a momentum of 0.9 and a batch-size of 128. The initial learning rate is 0.1 and we use the reduced schedule with decays of 0.2 every 15 epochs (Zagoruyko & Komodakis, 2016), for a total of 50 epochs in each layer. We used SGD with momentum 0.9, weight-decay of 10-4 for a batch size of 256. The initial learning rate is 0.1 (He et al., 2016) and we use the reduced schedule with decays of 0.1 every 20 epochs for 45 epochs.