G-SGD: Optimizing ReLU Neural Networks in its Positively Scale-Invariant Space
Authors: Qi Meng, Shuxin Zheng, Huishuai Zhang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, Nenghai Yu, Tie-Yan Liu
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that G-SGD significantly outperforms the conventional SGD algorithm in optimizing Re LU networks on benchmark datasets. |
| Researcher Affiliation | Collaboration | 1Microsoft Research Asia, 2,3University of Science and Technology of China 4University of Chinese Academy of Sciences |
| Pseudocode | Yes | Algorithm 1 G-SGD |
| Open Source Code | No | The Res Net implementation can be found in https://github.com/pytorch/vision/ and the models are initialized by the default methods in Py Torch. |
| Open Datasets | Yes | We apply our G-SGD to image classification tasks and conduct experiments on CIFAR10 and CIFAR-100 (Krizhevsky & Hinton, 2009). ... train several 2-hidden-layer MLP models on Fasion-MNIST (Xiao et al., 2017) ... conduct our studies on MNIST and CIFAR-10 datasets. |
| Dataset Splits | Yes | We do a wide range grid search for the hyper-parameter λ for weight decay and basis path regularization from {0.1, 0.2, 0.5} 10 α, where α {3, 4, 5, 6}, and report the best performance based on the CIFAR-10 validation set. |
| Hardware Specification | Yes | We implement our G-SGD using the Pytorch framework with v0.31 stable release, and conduct our experiments comparing with Pytorch built-in SGD optimizer. Our experiments are conducted on a GPU server with 4 NVIDIA GTX Titan Xp GPUs and PCI switch. |
| Software Dependencies | Yes | We implement our G-SGD using the Pytorch framework with v0.31 stable release |
| Experiment Setup | Yes | We apply random crop to the input image by size of 32 with padding 4, and normalize each pixel value to [0,1]. We then apply random horizontal flipping to the image. The mini-batch size of 128 is used in this experiment. The training is conducted for 64k iterations. We follow the learning rate schedule strategy in the original paper (He et al., 2016a), specifically, the initial learning rates of vanilla SGD and G-SGD are set to 1.0 and then divided by 10 after 32k and 48k iterations. |