Path-Level Network Transformation for Efficient Architecture Search
Authors: Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, Yong Yu
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimented on the image classification datasets with limited computational resources (about 200 GPU-hours), where we observed improved parameter efficiency and better test results (97.70% test accuracy on CIFAR-10 with 14.3M parameters and 74.6% top-1 accuracy on Image Net in the mobile setting), demonstrating the effectiveness and transferability of our designed architectures. |
| Researcher Affiliation | Academia | 1Shanghai Jiao Tong University, Shanghai, China 2Massachusetts Institute of Technology, Cambridge, USA. Correspondence to: Han Cai <hcai@apex.sjtu.edu.cn>. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Experiment code: https://github.com/han-cai/Path-Level-EAS |
| Open Datasets | Yes | CIFAR-10 (Krizhevsky & Hinton, 2009) for the image classification task and transfer the learned cell structures to Image Net dataset (Deng et al., 2009). |
| Dataset Splits | Yes | CIFAR-10 contains 50,000 training images and 10,000 test images, where we randomly sample 5,000 images from the training set to form a validation set for the architecture search process, similar to previous work (Zoph et al., 2017; Cai et al., 2018). |
| Hardware Specification | No | The paper mentions 'about 200 GPU-hours' for its experiments, but does not specify the exact GPU models, CPU models, or other detailed hardware specifications used. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | For the meta-controller, described in Section 3.3, the hidden state size of all LSTM units is 100 and we train it with the ADAM optimizer (Kingma & Ba, 2014) using the REINFORCE algorithm (Williams, 1992). To reduce variance, we adopt a baseline function which is an exponential moving average of previous rewards with a decay of 0.95, as done in Cai et al. (2018). We also use an entropy penalty with a weight of 0.01 to ensure exploration. ... The obtained network... is then trained for 20 epochs on CIFAR-10 with an initial learning rate of 0.035 that is further annealed with a cosine learning rate decay (Loshchilov & Hutter, 2016), a batch size of 64, a weight decay of 0.0001, using the SGD optimizer with a Nesterov momentum of 0.9. ...Additionally, we update the meta-controller with mini-batches of 10 architectures. ... In this stage, we train networks for 300 epochs with an initial learning rate of 0.1, while all other settings keep the same. ... We set the maximum depth of the cell structures to be 3... For nodes whose merge scheme is add, the number of branches is chosen from {2, 3} while for nodes whose merge scheme is concatenation, the number of branches is set to be 2. |