AlphaNet: Improved Training of Supernets with Alpha-Divergence
Authors: Dilin Wang, Chengyue Gong, Meng Li, Qiang Liu, Vikas Chandra
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply the proposed -divergence based supernets training to both slimmable neural networks and weight-sharing NAS, and demonstrate significant improvements. Specifically, our discovered model family, Alpha Net, outperforms prior-art models on a wide range of FLOPs regimes, including Big NAS, Once-for All networks, and Attentive NAS. We achieve Image Net top-1 accuracy of 80.0% with only 444M FLOPs. and 4. Experiments We apply our Adaptive-KD to improve notable supernet-based applications, including slimmable neural networks (Yu & Huang, 2019) and weight-sharing NAS (e.g., Cai et al., 2019a; Yu et al., 2020; Wang et al., 2020a). We provide an overview of our algorithm for training the supernet in Algorithm 1. |
| Researcher Affiliation | Collaboration | 1Facebook 2Department of Computer Science, The University of Texas at Austin. Correspondence to: Dilin Wang <wdilin@fb.com>, Chengyue Gong <cygong@cs.utexas.edu>, Meng Li <meng.li@fb.com>, Qiang Liu <lqiang@cs.utexas.edu>, Vikas Chandra <vchandra@fb.com>. |
| Pseudocode | Yes | Algorithm 1 Training supernets with -divergence |
| Open Source Code | Yes | Our code and pretrained models are available at https://github.com/ facebookresearch/Alpha Net. |
| Open Datasets | Yes | We evaluate on the Image Net dataset (Deng et al., 2009). |
| Dataset Splits | Yes | To estimate the performance Pareto, we proceed as follows: 1) we first randomly sample 512 sub-networks from the supernet and estimate their accuracy on the Image Net validation set; |
| Hardware Specification | No | We train all models for 360 epochs using SGD optimizer... and batch size of 2048 on 16 GPUs. No specific GPU model or other hardware specifications were provided. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA) are explicitly mentioned in the paper. |
| Experiment Setup | Yes | Additionally, we train all models for 360 epochs using SGD optimizer with momentum as 0.9, weight decay as 10 5 and dropout as 0.2. We use cosine learning rate decay, with an initial learning rate of 0.8, and batch size of 2048 on 16 GPUs. |