Capacity Control of ReLU Neural Networks by Basis-Path Norm

Authors: Shuxin Zheng, Qi Meng, Huishuai Zhang, Wei Chen, Nenghai Yu, Tie-Yan Liu5925-5932

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on benchmark datasets demonstrate that the proposed regularization method achieves clearly better performance on the test set than the previous regularization approaches. In this section, we study the relationship between this bound and the empirical generalization gap the absolute difference between test error and training error with real-data experiments
Researcher Affiliation Collaboration 1University of Science and Technology of China 2Microsoft Research Asia
Pseudocode Yes Algorithm 1 Optimize Re LU Network with SGD and Basispath Regularization
Open Source Code No The paper does not provide an explicit statement or link for the open-sourcing of its code for the described methodology.
Open Datasets Yes We conduct experiments with multi-layer perceptrons (MLP) with Re LU of different depths, widths, and global minima on MNIST classification task... We first apply our basis-path regularization method to recommendation task with MLP networks and conduct experimental studies based on a public dataset, Movie Lens. In this section, we apply our basis-path regularization to this task and conduct experimental studies based on CIFAR10 (Krizhevsky and Hinton 2009)
Dataset Splits Yes The training set consists of 10000 randomly selected samples with true labels and another at most 5000 intentionally mislabeled data which are gradually added into the training set. The evaluation of error rate is conducted on a fixed 10000 validation set.
Hardware Specification No The paper does not provide specific hardware specifications (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions various software components and optimizers (e.g., Adam optimizer, SGD, ResNet, Plain Net) but does not provide specific version numbers for any of these dependencies.
Experiment Setup Yes More details of the training strategies can be found in the appendices. We test the predictive factors of [8,16,32,64], and set the number of hidden units to the embedding size 4 in each hidden layer. For each method, we perform a wide range grid search of hyper-parameter λ from 10 α where α 5, 6, 7, 8, 9 and report the experimental results based on the best performance on the validation set. We train 34 layers Res Net and Plain Net networks on this dataset, and use SGD with widely used l2 weight decay regularization (WD) as our baseline. More training details can be found in supplementary materials.