AlphaNet: Improved Training of Supernets with Alpha-Divergence

Authors: Dilin Wang, Chengyue Gong, Meng Li, Qiang Liu, Vikas Chandra

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply the proposed -divergence based supernets training to both slimmable neural networks and weight-sharing NAS, and demonstrate significant improvements. Specifically, our discovered model family, Alpha Net, outperforms prior-art models on a wide range of FLOPs regimes, including Big NAS, Once-for All networks, and Attentive NAS. We achieve Image Net top-1 accuracy of 80.0% with only 444M FLOPs. and 4. Experiments We apply our Adaptive-KD to improve notable supernet-based applications, including slimmable neural networks (Yu & Huang, 2019) and weight-sharing NAS (e.g., Cai et al., 2019a; Yu et al., 2020; Wang et al., 2020a). We provide an overview of our algorithm for training the supernet in Algorithm 1.
Researcher Affiliation Collaboration 1Facebook 2Department of Computer Science, The University of Texas at Austin. Correspondence to: Dilin Wang <wdilin@fb.com>, Chengyue Gong <cygong@cs.utexas.edu>, Meng Li <meng.li@fb.com>, Qiang Liu <lqiang@cs.utexas.edu>, Vikas Chandra <vchandra@fb.com>.
Pseudocode Yes Algorithm 1 Training supernets with -divergence
Open Source Code Yes Our code and pretrained models are available at https://github.com/ facebookresearch/Alpha Net.
Open Datasets Yes We evaluate on the Image Net dataset (Deng et al., 2009).
Dataset Splits Yes To estimate the performance Pareto, we proceed as follows: 1) we first randomly sample 512 sub-networks from the supernet and estimate their accuracy on the Image Net validation set;
Hardware Specification No We train all models for 360 epochs using SGD optimizer... and batch size of 2048 on 16 GPUs. No specific GPU model or other hardware specifications were provided.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA) are explicitly mentioned in the paper.
Experiment Setup Yes Additionally, we train all models for 360 epochs using SGD optimizer with momentum as 0.9, weight decay as 10 5 and dropout as 0.2. We use cosine learning rate decay, with an initial learning rate of 0.8, and batch size of 2048 on 16 GPUs.