Smooth Loss Functions for Deep Top-k Classification

Authors: Leonard Berrada, Andrew Zisserman, M. Pawan Kumar

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical evidence suggests that the loss function must be smooth and have non-sparse gradients in order to work well with deep neural networks. Consequently, we introduce a family of smoothed loss functions that are suited to top-k optimization via deep learning. We compare the performance of the cross-entropy loss and our margin-based losses in various regimes of noise and data size, for the predominant use case of k = 5. Our investigation reveals that our loss is more robust to noise and overfitting than cross-entropy.
Researcher Affiliation Academia Leonard Berrada1, Andrew Zisserman1 and M. Pawan Kumar1,2 1Department of Engineering Science University of Oxford 2Alan Turing Institute {lberrada,az,pawan}@robots.ox.ac.uk
Pseudocode Yes Algorithm 1 Forward Pass Algorithm 2 Backward Pass Algorithm 3 Summation Algorithm
Open Source Code Yes The algorithms are implemented in Pytorch (Paszke et al., 2017) and are publicly available at https://github.com/oval-group/smooth-topk.
Open Datasets Yes We introduce label noise in the CIFAR-100 data set (Krizhevsky, 2009) in a manner that would not perturb the top-5 error of a perfect classifier. For the latter, we vary the training data size on subsets of the Image Net data set (Russakovsky et al., 2015).
Dataset Splits Yes Then we use the model with the best top-5 validation accuracy and report its performance on the test set. Out of the 1.28 million training samples, we use subsets of various sizes and always hold out a balanced validation set of 50,000 images.
Hardware Specification Yes Experiments on CIFAR-100 and Image Net are performed on respectively one and two Nvidia Titan Xp cards.
Software Dependencies No The algorithms are implemented in Pytorch (Paszke et al., 2017). However, no specific version numbers for Pytorch or any other software dependencies are provided.
Experiment Setup Yes We use the architecture Dense Net 40-40 from Huang et al. (2017), and we use the same hyper-parameters and learning rate schedule as in Huang et al. (2017). The temperature parameter is fixed to one. In all the following experiments, we train a Res Net-18 (He et al., 2016), adapting the protocol of the Image Net experiment in Huang et al. (2017). In more details, we optimize the model with Stochastic Gradient Descent with a batch-size of 256, for a total of 120 epochs. We use a Nesterov momentum of 0.9. The temperature is set to 0.1 for the SVM loss... The learning rate is divided by ten at epochs 30, 60 and 90, and is set to an initial value of 0.1 for CE and 1 for L5,0.1. The quadratic regularization hyper-parameter is set to 0.0001 for CE. For L5,0.1, it is set to 0.000025 to preserve a similar relative weighting of the loss and the regularizer.