Towards Theoretically Inspired Neural Initialization Optimization

Authors: Yibo Yang, Hong Wang, Haobo Yuan, Zhouchen Lin

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that for a variety of deep architectures including Res Net [19], Dense Net [21], and Wide Res Net [56], our method achieves better classification results on CIFAR-10/100 [27] than prior heuristic [18] and learning-based [8, 60] initialization methods. We can also initialize Res Net-50 [19] on Image Net [9] for better performance. Moreover, our method is able to help the recently proposed Swin-Transformer [32] achieve stable training and competitive results on Image Net even without warmup [17], which is crucial for the successful training of Transformer architectures [31, 52].
Researcher Affiliation Collaboration Yibo Yang1, Hong Wang2, Haobo Yuan3, Zhouchen Lin2,4,5 1JD Explore Academy, Beijing, China 2Key Lab. of Machine Perception (Mo E), School of Intelligence Science and Technology, Peking University 3Institute of Artificial Intelligence and School of Computer Science, Wuhan University 4Institute for Artificial Intelligence, Peking University 5Pazhou Laboratory, Guangzhou, China
Pseudocode Yes Algorithm 1 Grad Cosine (GC) and gradient norm (GN) ... Algorithm 2 Batch Grad Cosine (B-GC) and batch gradient norm (B-GN) ... Algorithm 3 Neural Initialization Optimization
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See supplementary material.
Open Datasets Yes We validate our method on three widely used datasets including CIFAR10/100 [27] and Image Net [9]. ... [27] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009. ... [9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248 255, 2009.
Dataset Splits No The paper mentions training models and evaluating on a test set, and refers to 'Detailed settings for different architectures and datasets are described in Appendix B.' However, it does not explicitly state the specific training/validation/test splits (e.g., percentages or counts) within the provided main text or explicitly mention a 'validation set' split.
Hardware Specification Yes Train time is tested on an NVIDIA A100 server with a batchsize of 256 among 8 GPUs.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies (e.g., Python version, PyTorch version, CUDA version).
Experiment Setup Yes After initialization, we train these models for 500 epochs with the same training setting. Each model is trained four times with different seeds. ... using Res Net-50 for 100 epochs with a batchsize of 256. ... Detailed training and initialization settings are described in Appendix B.