Multi-Bias Non-linear Activation in Deep Neural Networks

Authors: Hongyang Li, Wanli Ouyang, Xiaogang Wang

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed MBA module and compare with other state-of-the-arts on several benchmarks. The CIFAR-10 dataset (Krizhevsky et al., 2012) consists of 32 32 color images on 10 classes with 50,000 training images and 10,000 testing images.
Researcher Affiliation Academia Hongyang Li YANGLI@EE.CUHK.EDU.HK Wanli Ouyang WLOUYANG@EE.CUHK.EDU.HK Xiaogang Wang XGWANG@EE.CUHK.EDU.HK The Chinese University of Hong Kong
Pseudocode No The paper describes the model and processes using mathematical formulations and textual descriptions, but it does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Finally, the implementation code is available at https://github.com/hli2020/caffe/tree/bias.
Open Datasets Yes The CIFAR-10 dataset (Krizhevsky et al., 2012) consists of 32 32 color images on 10 classes with 50,000 training images and 10,000 testing images. The CIFAR-100 dataset has the same size and format as CIFAR-10, but contains 100 classes, with only one tenth as many labeled examples per class. The SVHN (Netzer et al., 2011) dataset resembles MNIST and consists of color images of house numbers captured by Google street view.
Dataset Splits Yes We follow a similar split-up of the validation set from the training set as (Goodfellow et al., 2013), where one tenth of samples per class from the training set on CIFAR, and 400 plus 200 samples per class from the training and the extra set on SVHN, are selected to build a validation set.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions 'caffe' as part of the code repository URL, but it does not specify version numbers for Caffe or any other software dependencies.
Experiment Setup Yes Our baseline network has three stacks of convolutional layers with each stack containing three convolutional layers, resulting in a total number of nine layers. Each stack has [96-96-96], [128-128-128] and [256-256-512] filters, respectively. The kernel size is 3 and padded by 1 pixel on each side with stride 1 for all convolutional layers. At the end of each convolutional stack is a max-pooling operation with kernel and stride size of 2. The two fully connected layers have 2048 neurons each. We also apply dropout with ratio 0.5 after each fully connected layers. The final layer is a softmax classification layer. The optimal training hyperparameters are determined on each validation set. We set the momentum as 0.9 and the weight decay to be 0.005. The base learning rate is set to be 0.1, 0.1, 0.05, respectively. We drop the learning rate by 10% around every 40 epoches in a continuous exponential way and stop to decrease the learning rate until it reaches a minimum value (0.0001). ... We use the hyperparameter K = 4 for the MBA module and the mini-batch size of 100 for stochastic gradient descent. All the convolutional layers are initialized with Gaussian distribution with mean of zero and standard variation of 0.05 or 0.1.