Decoupled Greedy Learning of CNNs

Authors: Eugene Belilovsky, Michael Eickenberg, Edouard Oyallon

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments that empirically show that DGL optimizes the greedy objective well, showing it is favorable against recent state-of-the-art proposals for decoupling training of deep network modules. We show that unlike previous decoupled proposals it can still work on a large-scale dataset (Image Net) and that it can, in some cases, generalize better than standard back-propagation. We then extensively evaluate the asynchronous DGL, simulating large delays.
Researcher Affiliation Collaboration 1MILA 2Center for Computational Mathematics, Flatiron Institute 3CNRS, LIP6.
Pseudocode Yes Algorithm 1: Synchronous DGL Algorithm 2: Asynchronous DGL with Replay
Open Source Code Yes Code for experiments is included in the submission.
Open Datasets Yes We demonstrate the effectiveness of DGL against alternative approaches on the CIFAR-10 dataset and on the large-scale Image Net dataset. (Krizhevsky, 2009)
Dataset Splits No The paper mentions using CIFAR-10 and Image Net datasets but does not explicitly specify the proportions or counts for training, validation, and test splits, nor does it refer to specific predefined standard splits with detailed information.
Hardware Specification No The paper mentions 'single 16GB GPU' but does not provide specific models or manufacturers for the hardware used in experiments (e.g., NVIDIA A100, Intel Xeon).
Software Dependencies No The paper does not provide specific version numbers for software dependencies or libraries (e.g., 'Python 3.8, PyTorch 1.9'). It mentions optimizers like Adam and SGD but without software versions.
Experiment Setup Yes We reproduce the CIFAR-10 CNN experiment described in (Jaderberg et al., 2017), Appendix C.1. This experiment utilizes a 3 layer network with auxiliary networks of 2 hidden CNN layers... using Adam with a learning rate of 3 10 5. We run training for 1500 epochs... For this experiment we use a buffer of size M = 50. We run separate experiments with the slowdown applied at each layer of the network as well as 3 random seeds for each of these settings (thus 18 experiments per data point). We show the evaluations for 10 values of S.