Deep Architecture Connectivity Matters for Its Convergence: A Fine-Grained Analysis
Authors: Wuyang Chen, Wei Huang, Xinyu Gong, Boris Hanin, Zhangyang Wang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first experimentally verify our convergence analysis in Section 3.3. In all cases we use Re LU nonlinearities with Kaiming normal initialization [24]. We build the same three computational graphs of fully-connected layers in Figure 3. Three networks have hidden layers of a constant width of 1024. We train the network using SGD with a mini-batch of size 128. The learning rate is fixed at 1 10 5. No augmentation, weight decay, learning rate decay, or momentum is adopted. |
| Researcher Affiliation | Academia | Wuyang Chen University of Texas at Austin Wei Huang RIKEN AIP Xinyu Gong University of Texas at Austin Boris Hanin Princeton University Zhangyang Wang University of Texas at Austin |
| Pseudocode | Yes | We provide a pseudocode algorithm in Appendix A to demonstrate the usage of our method. |
| Open Source Code | Yes | Code is available at: https://github.com/VITA-Group/architecture_convergence. |
| Open Datasets | Yes | On both MNIST and CIFAR-10, the convergence rate of DAG#1 (Figure 3 left) is worse than DAG#2 (Figure 3 middle), and is further worse than DAG#3 (Figure 3 right). |
| Dataset Splits | Yes | The NAS-Bench-201 [17] provides 15,625 architectures that are stacked by repeated DAGs of four nodes (exactly the same DAG we considered in Section 3 and Figure 2). It contains architecture s performance on three datasets (CIFAR-10, CIFAR-100, Image Net-16-120 [15]) evaluated under a unified protocol (i.e. same learning rate, batch size, etc., for all architectures). |
| Hardware Specification | Yes | Recorded on a single GTX 1080Ti GPU. |
| Software Dependencies | No | The paper does not specify software dependencies with version numbers. |
| Experiment Setup | Yes | We train searched architectures for 250 epochs using SGD, with a learning rate as 0.5, a cosine scheduler, momentum as 0.9, weight decay as 3 10 5, and a batch size as 768. This setting follows previous works [1, 44, 69, 41, 66, 26, 12, 10]. |