Double Forward Propagation for Memorized Batch Normalization

Authors: Yong Guo, Qingyao Wu, Chaorui Deng, Jian Chen, Mingkui Tan

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate the proposed MBN method in image classification tasks. We apply MBN on several well-known models, including VGG (Krizhevsky, Sutskever, and Hinton 2012) and Res Net (He et al. 2016).
Researcher Affiliation Academia Yong Guo, Qingyao Wu, Chaorui Deng, Jian Chen, Mingkui Tan School of Software Engineering, South China University of Technology, China {guo.yong, secrdyz}@mail.scut.edu.cn, {qyw, ellachen, mingkuitan}@scut.edu.cn
Pseudocode Yes Algorithm 1 Training MBN in Single Iteration. Require: Recorded statistics in memory: {μi}k i=1, {σi}k i=1; Mean and variance of the current batch: μB, σB; Weights for batches in memory: {αi}k i=1; Learnable parameters: γ, β. We define μk+1 = μB, σk+1 = σB for convenience.
Open Source Code No The paper does not provide any statement or link for open-source code for the described methodology.
Open Datasets Yes In the experiments, three benchmark datasets are used: CIFAR-10, CIFAR-100 (Krizhevsky and Hinton 2009) and Image Net (Russakovsky et al. 2015).
Dataset Splits No The paper specifies training and testing sample counts for datasets like CIFAR-10 ("5,000 training samples and 1,000 testing samples") but does not explicitly provide details on a separate validation dataset split.
Hardware Specification Yes All the experiments are conducted on a GPU Server with one Titan X GPU.
Software Dependencies No The paper states "All compared models are implemented based on Py Torch" but does not specify a version number for Py Torch or other software dependencies.
Experiment Setup Yes Without special specification, we train the models through SGD with a minibatch size of 128. The momentum for SGD is 0.9 and the weight decay is set to 10 4. The learning rate is initialized as 0.1 and then is divided by 10 at 40% and 60% of the training procedure, respectively. For MBN methods, we first set the parameter λ = 0.1 and then increase it to 0.5 and 0.9 at 40% and 60% of the training procedure, which is referred to as λ = {0.1, 0.5, 0.9}. And the decaying parameter η in Eqn.(8) is set to 0.9. All the experiments are performed with 200 training epochs.