SEDONA: Search for Decoupled Neural Networks toward Greedy Block-wise Learning

Authors: Myeongjang Pyeon, Jihwan Moon, Taeyoung Hahn, Gunhee Kim

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our algorithm can consistently discover transferable decoupled architectures for VGG and Res Net variants, and significantly outperforms the ones trained with end-to-end backpropagation and other state-of-the-art greedy-leaning methods in CIFAR-10, Tiny-Image Net and Image Net. We experiment the proposed SEDONA in two stages of search and evaluation. In the search stage, SEDONA searches for the best decoupling configuration for a given neural network on CIFAR10 to minimize the validation loss. In the evaluation stage, we split the networks according to the searched configuration, and evaluate their greedy block-wise learning performance for classification in CIFAR-10 (Krizhevsky & Hinton, 2009), Tiny-Image Net1 and Image Net (Russakovsky et al., 2015).
Researcher Affiliation Academia Myeongjang Pyeon, Jihwan Moon, Taeyoung Hahn, and Gunhee Kim Seoul National University, Seoul, Korea
Pseudocode Yes Algorithm 1: SEDONA Searching for Decoupled Neural Architectures
Open Source Code No For Pred Sim, DGL and Features Replay implementations, we refer to their official Py Torch implementations. Pred Sim: https://github.com/anokland/local-loss, DGL: https://github.com/ eugenium/DGL, Features Replay: https://github.com/slowbull/Features Replay. The paper does not state that the code for SEDONA itself is open source or provide a link to it.
Open Datasets Yes evaluate their greedy block-wise learning performance for classification in CIFAR-10 (Krizhevsky & Hinton, 2009), Tiny-Image Net1 and Image Net (Russakovsky et al., 2015). 1http://tiny-imagenet.herokuapp.com/.
Dataset Splits Yes We use 40% of CIFAR-10 training split as a validation set. 10% of train data is used as the validation set.
Hardware Specification Yes All experiments are conducted with total 8 NVIDIA Quadro 6000 GPU cards and 2 8-core Intel Xeon E5-2620 v4 processors with 256 GB RAM.
Software Dependencies Yes For implementation, we use Python 3.8 and Py Torch 1.6.0. At the search stage, we use the higher library1 to enable differentiable weight updates in Py Torch computational graphs. For evaluation, we implement asynchronous updates of blocks by introducing queues between blocks. For Pred Sim, DGL and Features Replay implementations, we refer to their official Py Torch implementations2. We use mixed precision training with Apex3 on Tiny-Image Net and Image Net.
Experiment Setup Yes We use Adam optimizer (Kingma & Ba, 2015) with a fixed learning rate of 0.01 and a weight decay of 0.000001. For the inner optimization, we use SGD with a momentum of 0.9 and a weight decay of 0.001. We use an initial learning rate of 0.1 and decay it down to 0.001 with the cosine annealing learning rate decay (Loshchilov & Hutter, 2017). Label smoothing (Szegedy et al., 2016) of 0.1 is also used. We repeat bilevel optimization steps for 2K iterations. As mentioned in Section 3.3, we pretrain weights for 40K iterations with outer variables fixed as zero and store 50 sets of weights with the best validation accuracies.