Robust Bi-Tempered Logistic Loss Based on Bregman Divergences

Authors: Ehsan Amid, Manfred K. K. Warmuth, Rohan Anil, Tomer Koren

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We visualize the effect of tuning the two temperatures in a simple setting and show the efficacy of our method on large datasets. Our methodology is based on Bregman divergences and is superior to a related two-temperature method that uses the Tsallis divergence. We perform experiments by adding synthetic label noise to MNIST and CIFAR-100 datasets and compare the results of our robust bi-tempered loss to the vanilla logistic loss.
Researcher Affiliation Collaboration Department of Computer Science, University of California, Santa Cruz School of Computer Science, Tel Aviv University, Tel Aviv, Israel : Google Brain {eamid,manfred,rohananil,tkoren}@google.com
Pseudocode Yes Algorithm 1 Computing λt(a) for t > 1 (Fixed-Point Iteration)
Open Source Code Yes An implementation of the bi-tempered logistic loss is available online at: https://github.com/google/bi-tempered-loss.
Open Datasets Yes For moderate-size experiments, we use MNIST dataset of handwritten digits [14] and CIFAR-100, which contains real-world images from 100 different classes [13]. We use Image Net-2012 [6] for large scale image classification, having 1000 classes.
Dataset Splits Yes For both experiments, we report the test accuracy of the checkpoint which yields the highest accuracy on an identically label-noise corrupted validation set.
Hardware Specification Yes We use P100 GPU s for small-scale experiments and Cloud TPU-v2 for larger scale Image Net-2012 experiments.
Software Dependencies No The paper mentions using
Experiment Setup Yes For MNIST, we use a CNN with two convolutional layers of size 32 and 64 with a mask size of 5, followed by two fully-connected layers of size 1024 and 10. We apply max-pooling after each convolutional layer with a window size equal to 2 and use dropout during training with keep probability equal to 0.75. We use the Ada Delta optimizer [21] with 500 epochs and batch size of 128 for training. For CIFAR-100, we use a Resnet-56 [10] model without batch norm from [9] with SGD + momentum optimizer trained for 50k steps with batch size of 128 and use the standard learning rate stair case decay schedule. All experiments were trained for 180 epochs, and use the SGD + momentum optimizer with staircase learning rate decay schedule.