Large-Margin Softmax Loss for Convolutional Neural Networks

Authors: Weiyang Liu, Yandong Wen, Zhiding Yu, Meng Yang

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on four benchmark datasets demonstrate that the deeply-learned features with L-softmax loss become more discriminative, hence significantly boosting the performance on a variety of visual classification and verification tasks.
Researcher Affiliation Academia Weiyang Liu1 WYLIU@PKU.EDU.CN Yandong Wen2 WEN.YANDONG@MAIL.SCUT.EDU.CN Zhiding Yu3 YZHIDING@ANDREW.CMU.EDU Meng Yang4 YANG.MENG@SZU.EDU.CN 1School of ECE, Peking University 2School of EIE, South China University of Technology 3Dept. of ECE, Carnegie Mellon University 4College of CS & SE, Shenzhen University
Pseudocode No The paper provides mathematical derivations and discusses forward/backward propagation for the L-Softmax loss, but it does not include explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets Yes We evaluate the generalized softmax loss in two typical vision applications: visual classification and face verification. In visual classification, we use three standard benchmark datasets: MNIST (Le Cun et al., 1998), CIFAR10 (Krizhevsky, 2009), and CIFAR100 (Krizhevsky, 2009). In face verification, we evaluate our method on the widely used LFW dataset (Huang et al., 2007). ... we train on the publicly available CASIA-Web Face (Yi et al., 2014) outside dataset
Dataset Splits Yes We start with a learning rate of 0.1, divide it by 10 at 12k and 15k iterations, and eventually terminate training at 18k iterations, which is determined on a 45k/5k train/val split.
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments, such as GPU or CPU models.
Software Dependencies No We implement the CNNs using the Caffe library (Jia et al., 2014) with our modifications. (No specific version number for Caffe or other software dependencies is provided).
Experiment Setup Yes Our CNN architectures are described in Table 1. In convolution layers, the stride is set to 1 if not specified. We implement the CNNs using the Caffe library (Jia et al., 2014) with our modifications. For all experiments, we adopt the PRe LU (He et al., 2015b) as the activation functions, and the batch size is 256. We use a weight decay of 0.0005 and momentum of 0.9. The weight initialization in (He et al., 2015b) and batch normalization (Ioffe & Szegedy, 2015) are used in our networks but without dropout. Note that we only perform the mean substraction preprocessing for training and testing data. For optimization, normally the stochastic gradient descent will work well. ... We start with a learning rate of 0.1, divide it by 10 at 12k and 15k iterations, and eventually terminate training at 18k iterations, which is determined on a 45k/5k train/val split. ... The learning rate is set to 0.1, 0.01, 0.001 and is switched when the training loss plateaus. The total number of epochs is about is about 30 for our models.