Network Pruning via Transformable Architecture Search

Authors: Xuanyi Dong, Yi Yang

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on CIFAR-10, CIFAR-100 and Image Net demonstrate the effectiveness of our new perspective of network pruning compared to traditional network pruning algorithms. Various searching and knowledge transfer approaches are conducted to show the effectiveness of the two components.
Researcher Affiliation Collaboration Xuanyi Dong , Yi Yang Re LER, CAI, University of Technology Sydney, Baidu Research xuanyi.dong@student.uts.edu.au; yi.yang@uts.edu.au
Pseudocode Yes Algorithm 1 The TAS Procedure
Open Source Code Yes Code is at: https://github.com/D-X-Y/NAS-Projects.
Open Datasets Yes We evaluate our approach on CIFAR-10, CIFAR-100 [27] and Image Net [6].
Dataset Splits Yes Input: split the training set into two disjoint sets: Dtrain and Dval
Hardware Specification Yes We try (1) only searching depth ( TAS (D) ), (2) only searching width ( TAS (W) ), and (3) searching both depth and width ( TAS ) in Table 3. Results of only searching depth are worse than results of only searching width. If we jointly search for both depth and width, we can achieve better accuracy with similar FLOP than both searching depth and searching width only. The speedup gain. As shown in Table 2, TAS can finish the searching procedure of Res Net-32 in about 3.8 hours on a single V100 GPU . For Res Net-18, it takes about 59 hours to search for the pruned network on 4 NVIDIA Tesla V100 GPUs.
Software Dependencies No The paper mentions using SGD, Adam, and cosine scheduler, but does not provide specific version numbers for software libraries (e.g., PyTorch, TensorFlow) or their dependencies.
Experiment Setup Yes For the weights, we start the learning rate from 0.1 and reduce it by the cosine scheduler [34]. For the architecture parameters, we use the constant learning rate of 0.001 and a weight decay of 0.001. On both CIFAR-10 and CIFAR-100, we train the model for 600 epochs with the batch size of 256. On Image Net, we train Res Nets [17] for 120 epochs with the batch size of 256. The toleration ratio t is always set as 5%. The τ in Eq. (3) is linearly decayed from 10 to 0.1. For CIFAR experiments, we use SGD with a momentum of 0.9 and a weight decay of 0.0005. We train each model by 300 epochs, start the learning rate at 0.1, and reduce it by the cosine scheduler [34]. We use the batch size of 256 and 2 GPUs. When using KD on CIFAR, we use λ of 0.9 and the temperature T of 4 following [46]. For Res Net models on Image Net, we follow most hyper-parameters as CIFAR, but use a weight decay of 0.0001. We use 4 GPUs to train the model by 120 epochs with the batch size of 256. When using KD on Image Net, we set λ as 0.5 and T as 4 on Image Net.