reproducibilityindex.ai

AlphaNet: Improved Training of Supernets with Alpha-Divergence

Authors: Dilin Wang, Chengyue Gong, Meng Li, Qiang Liu, Vikas Chandra

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply the proposed -divergence based supernets training to both slimmable neural networks and weight-sharing NAS, and demonstrate signiﬁcant improvements. Speciﬁcally, our discovered model family, Alpha Net, outperforms prior-art models on a wide range of FLOPs regimes, including Big NAS, Once-for All networks, and Attentive NAS. We achieve Image Net top-1 accuracy of 80.0% with only 444M FLOPs. and 4. Experiments We apply our Adaptive-KD to improve notable supernet-based applications, including slimmable neural networks (Yu & Huang, 2019) and weight-sharing NAS (e.g., Cai et al., 2019a; Yu et al., 2020; Wang et al., 2020a). We provide an overview of our algorithm for training the supernet in Algorithm 1.
Researcher Affiliation	Collaboration	1Facebook 2Department of Computer Science, The University of Texas at Austin. Correspondence to: Dilin Wang <wdilin@fb.com>, Chengyue Gong <cygong@cs.utexas.edu>, Meng Li <meng.li@fb.com>, Qiang Liu <lqiang@cs.utexas.edu>, Vikas Chandra <vchandra@fb.com>.
Pseudocode	Yes	Algorithm 1 Training supernets with -divergence
Open Source Code	Yes	Our code and pretrained models are available at https://github.com/ facebookresearch/Alpha Net.
Open Datasets	Yes	We evaluate on the Image Net dataset (Deng et al., 2009).
Dataset Splits	Yes	To estimate the performance Pareto, we proceed as follows: 1) we ﬁrst randomly sample 512 sub-networks from the supernet and estimate their accuracy on the Image Net validation set;
Hardware Specification	No	We train all models for 360 epochs using SGD optimizer... and batch size of 2048 on 16 GPUs. No specific GPU model or other hardware specifications were provided.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA) are explicitly mentioned in the paper.
Experiment Setup	Yes	Additionally, we train all models for 360 epochs using SGD optimizer with momentum as 0.9, weight decay as 10 5 and dropout as 0.2. We use cosine learning rate decay, with an initial learning rate of 0.8, and batch size of 2048 on 16 GPUs.