Network Pruning via Transformable Architecture Search
Authors: Xuanyi Dong, Yi Yang
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on CIFAR-10, CIFAR-100 and Image Net demonstrate the effectiveness of our new perspective of network pruning compared to traditional network pruning algorithms. Various searching and knowledge transfer approaches are conducted to show the effectiveness of the two components. |
| Researcher Affiliation | Collaboration | Xuanyi Dong , Yi Yang Re LER, CAI, University of Technology Sydney, Baidu Research xuanyi.dong@student.uts.edu.au; yi.yang@uts.edu.au |
| Pseudocode | Yes | Algorithm 1 The TAS Procedure |
| Open Source Code | Yes | Code is at: https://github.com/D-X-Y/NAS-Projects. |
| Open Datasets | Yes | We evaluate our approach on CIFAR-10, CIFAR-100 [27] and Image Net [6]. |
| Dataset Splits | Yes | Input: split the training set into two disjoint sets: Dtrain and Dval |
| Hardware Specification | Yes | We try (1) only searching depth ( TAS (D) ), (2) only searching width ( TAS (W) ), and (3) searching both depth and width ( TAS ) in Table 3. Results of only searching depth are worse than results of only searching width. If we jointly search for both depth and width, we can achieve better accuracy with similar FLOP than both searching depth and searching width only. The speedup gain. As shown in Table 2, TAS can finish the searching procedure of Res Net-32 in about 3.8 hours on a single V100 GPU . For Res Net-18, it takes about 59 hours to search for the pruned network on 4 NVIDIA Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions using SGD, Adam, and cosine scheduler, but does not provide specific version numbers for software libraries (e.g., PyTorch, TensorFlow) or their dependencies. |
| Experiment Setup | Yes | For the weights, we start the learning rate from 0.1 and reduce it by the cosine scheduler [34]. For the architecture parameters, we use the constant learning rate of 0.001 and a weight decay of 0.001. On both CIFAR-10 and CIFAR-100, we train the model for 600 epochs with the batch size of 256. On Image Net, we train Res Nets [17] for 120 epochs with the batch size of 256. The toleration ratio t is always set as 5%. The τ in Eq. (3) is linearly decayed from 10 to 0.1. For CIFAR experiments, we use SGD with a momentum of 0.9 and a weight decay of 0.0005. We train each model by 300 epochs, start the learning rate at 0.1, and reduce it by the cosine scheduler [34]. We use the batch size of 256 and 2 GPUs. When using KD on CIFAR, we use λ of 0.9 and the temperature T of 4 following [46]. For Res Net models on Image Net, we follow most hyper-parameters as CIFAR, but use a weight decay of 0.0001. We use 4 GPUs to train the model by 120 epochs with the batch size of 256. When using KD on Image Net, we set λ as 0.5 and T as 4 on Image Net. |