Zero Stability Well Predicts Performance of Convolutional Neural Networks
Authors: Liangming Chen, Long Jin, Mingsheng Shang6268-6277
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we present our results from three aspects: Our experiments show that Zero SNet outperforms existing CNNs which is based on high-order discretization; Zero SNets show better robustness against noises on the input. Four groups of experiments are carried out in this paper. |
| Researcher Affiliation | Academia | 1 Chongqing Key Laboratory of Big Data and Intelligent Computing, Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences 2 Chongqing School, University of Chinese Academy of Sciences 3 School of Information Science and Engineering, Lanzhou University |
| Pseudocode | No | No pseudocode or clearly labeled algorithm block was found in the paper. The methodology is described textually and with diagrams (Figure 3). |
| Open Source Code | Yes | The source code is available at https://github.com/logichen/Zero SNet. |
| Open Datasets | Yes | Empirically, we present our results from three aspects: We provide extensive empirical evidence of different depth on different datasets to show that the moduli of the characteristic equation s roots are the keys for the performance of CNNs that require historical features; Our experiments show that Zero SNet outperforms existing CNNs which is based on high-order discretization; Zero SNets show better robustness against noises on the input. Note that hyperparameters for CIFAR-10 and CIAFR-100 are the same as those in (Lu et al. 2018). We compare the Zero SNet with existing high-order-discretization CNNs (LM-Res Nets) and Pre Res Nets (Lu et al. 2018; He et al. 2016) on CIAFR-10 and CIFAR-100 datasets. In addition, comparisons on Image Net are also performed. |
| Dataset Splits | Yes | On the CIFAR, we use a batch size of 128 with an initial learning rate of 0.1, the momentum of 0.9, and weight decay 0.0001. Models in generalization gap experiments (Table 7) are trained for 500 epochs to achieve sufficient training. Except for the generalization gap experiments, all models on CIFAR-10 and CIFAR-100 are trained for 160 and 300 epochs, respectively. We apply the step decay to train all models on CIFAR and divide the learning rate by 10 at half and three-quarters of the total epoch. |
| Hardware Specification | Yes | We use Pytorch 1.8.1 framework and run our experiments on a server with 10 RTX 2080 TI GPUs and 2 RTX 3090 GPUs. |
| Software Dependencies | Yes | We use Pytorch 1.8.1 framework and run our experiments on a server with 10 RTX 2080 TI GPUs and 2 RTX 3090 GPUs. |
| Experiment Setup | Yes | On the CIFAR, we use a batch size of 128 with an initial learning rate of 0.1, the momentum of 0.9, and weight decay 0.0001. Models in generalization gap experiments (Table 7) are trained for 500 epochs to achieve sufficient training. Except for the generalization gap experiments, all models on CIFAR-10 and CIFAR-100 are trained for 160 and 300 epochs, respectively. We apply the step decay to train all models on CIFAR and divide the learning rate by 10 at half and three-quarters of the total epoch. We report the mean standard deviations accuracies based on three individual runs. For the trainable version of Zero SNet (i.e., Zero SNet Tra), all λn are initialized as 1. The data augmentations are the random crop with a 4-pixel padding and random horizontal flip, as in (Lu et al. 2018). Our training script is based on https://github. com/13952522076/Efficient Image Net Classification and remains all default hyperparameters. To improve the training efficiency on Image Net, we use a mix-precision strategy provided by NVIDIA apex with distributed training. We apply the cosine decay with a 5-epoch warmup to train models for 150 epochs. The weight decay and the momentum are 4 10 5 and 0.9, respectively. Following the adjustment guidance of the learning rate and the batch size (Goyal et al. 2017; Jastrzebski et al. 2018), we set them according to the GPU memory. Specifically, for 18-layer models, we use an initial learning rate of 0.2 and a batch size of 128; for 34-layer models, we use an initial learning rate of 0.1 and a batch size of 64; for 50-layer models, we use an initial learning rate of 0.05 and a batch size of 32. For Image Net, we apply 8-GPU distributed training on a single server. |