reproducibilityindex.ai

G-SGD: Optimizing ReLU Neural Networks in its Positively Scale-Invariant Space

Authors: Qi Meng, Shuxin Zheng, Huishuai Zhang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, Nenghai Yu, Tie-Yan Liu

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that G-SGD significantly outperforms the conventional SGD algorithm in optimizing Re LU networks on benchmark datasets.
Researcher Affiliation	Collaboration	1Microsoft Research Asia, 2,3University of Science and Technology of China 4University of Chinese Academy of Sciences
Pseudocode	Yes	Algorithm 1 G-SGD
Open Source Code	No	The Res Net implementation can be found in https://github.com/pytorch/vision/ and the models are initialized by the default methods in Py Torch.
Open Datasets	Yes	We apply our G-SGD to image classiﬁcation tasks and conduct experiments on CIFAR10 and CIFAR-100 (Krizhevsky & Hinton, 2009). ... train several 2-hidden-layer MLP models on Fasion-MNIST (Xiao et al., 2017) ... conduct our studies on MNIST and CIFAR-10 datasets.
Dataset Splits	Yes	We do a wide range grid search for the hyper-parameter λ for weight decay and basis path regularization from {0.1, 0.2, 0.5} 10 α, where α {3, 4, 5, 6}, and report the best performance based on the CIFAR-10 validation set.
Hardware Specification	Yes	We implement our G-SGD using the Pytorch framework with v0.31 stable release, and conduct our experiments comparing with Pytorch built-in SGD optimizer. Our experiments are conducted on a GPU server with 4 NVIDIA GTX Titan Xp GPUs and PCI switch.
Software Dependencies	Yes	We implement our G-SGD using the Pytorch framework with v0.31 stable release
Experiment Setup	Yes	We apply random crop to the input image by size of 32 with padding 4, and normalize each pixel value to [0,1]. We then apply random horizontal flipping to the image. The mini-batch size of 128 is used in this experiment. The training is conducted for 64k iterations. We follow the learning rate schedule strategy in the original paper (He et al., 2016a), speciﬁcally, the initial learning rates of vanilla SGD and G-SGD are set to 1.0 and then divided by 10 after 32k and 48k iterations.