Do We Need Zero Training Loss After Achieving Zero Training Error?

Authors: Takashi Ishida, Ikko Yamane, Tomoya Sakai, Gang Niu, Masashi Sugiyama

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally show that flooding improves performance and, as a byproduct, induces a double descent curve of the test loss.
Researcher Affiliation Collaboration 1The University of Tokyo 2RIKEN 3NEC Corporation.
Pseudocode Yes A minimal working example with a mini-batch in Py Torch (Paszke et al., 2019) is demonstrated below to show the additional one line of code: 1 outputs = model(inputs) 2 loss = criterion(outputs, labels) 3 flood = (loss-b).abs()+b # This is it! 4 optimizer.zero_grad() 5 flood.backward() 6 optimizer.step()
Open Source Code Yes The implementation in this section and the next is based on Py Torch (Paszke et al., 2019) and demo code is available.6 https://github.com/takashiishida/ flooding
Open Datasets Yes Data We use three types of synthetic data: Two Gaussians, Sinusoid (Nakkiran et al., 2019), and Spiral (Sugiyama, 2015). We use the following benchmark datasets: MNIST, Kuzushiji-MNIST, SVHN, CIFAR-10, and CIFAR100.
Dataset Splits Yes The training, validation, and test sample sizes are 100, 100, and 20000, respectively. We split the original training dataset into training and validation data with with a proportion of 80 : 20 except for when we used data augmentation, we used 85 : 15.
Hardware Specification Yes Experiments were carried out with NVIDIA Ge Force GTX 1080 Ti, NVIDIA Quadro RTX 5000 and Intel Xeon Gold 6142.
Software Dependencies No The paper mentions using "Py Torch (Paszke et al., 2019)" but does not specify a version number for it or any other software dependency.
Experiment Setup Yes We train the network for 500 epochs with the logistic loss and the Adam (Kingma & Ba, 2015) optimizer with 100 mini-batch size and learning rate of 0.001. The flood level is chosen from b {0, 0.01, 0.01, . . . , 0.50}. Stochastic gradient descent (Robbins & Monro, 1951) is used with learning rate of 0.1 and momentum of 0.9 for 500 epochs. We perform the exhaustive hyper-parameter search for the flood level with candidates from {0.00, 0.01, . . . , 0.10}.