reproducibilityindex.ai

Do We Need Zero Training Loss After Achieving Zero Training Error?

Authors: Takashi Ishida, Ikko Yamane, Tomoya Sakai, Gang Niu, Masashi Sugiyama

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimentally show that ﬂooding improves performance and, as a byproduct, induces a double descent curve of the test loss.
Researcher Affiliation	Collaboration	1The University of Tokyo 2RIKEN 3NEC Corporation.
Pseudocode	Yes	A minimal working example with a mini-batch in Py Torch (Paszke et al., 2019) is demonstrated below to show the additional one line of code: 1 outputs = model(inputs) 2 loss = criterion(outputs, labels) 3 flood = (loss-b).abs()+b # This is it! 4 optimizer.zero_grad() 5 flood.backward() 6 optimizer.step()
Open Source Code	Yes	The implementation in this section and the next is based on Py Torch (Paszke et al., 2019) and demo code is available.6 https://github.com/takashiishida/ flooding
Open Datasets	Yes	Data We use three types of synthetic data: Two Gaussians, Sinusoid (Nakkiran et al., 2019), and Spiral (Sugiyama, 2015). We use the following benchmark datasets: MNIST, Kuzushiji-MNIST, SVHN, CIFAR-10, and CIFAR100.
Dataset Splits	Yes	The training, validation, and test sample sizes are 100, 100, and 20000, respectively. We split the original training dataset into training and validation data with with a proportion of 80 : 20 except for when we used data augmentation, we used 85 : 15.
Hardware Specification	Yes	Experiments were carried out with NVIDIA Ge Force GTX 1080 Ti, NVIDIA Quadro RTX 5000 and Intel Xeon Gold 6142.
Software Dependencies	No	The paper mentions using "Py Torch (Paszke et al., 2019)" but does not specify a version number for it or any other software dependency.
Experiment Setup	Yes	We train the network for 500 epochs with the logistic loss and the Adam (Kingma & Ba, 2015) optimizer with 100 mini-batch size and learning rate of 0.001. The ﬂood level is chosen from b {0, 0.01, 0.01, . . . , 0.50}. Stochastic gradient descent (Robbins & Monro, 1951) is used with learning rate of 0.1 and momentum of 0.9 for 500 epochs. We perform the exhaustive hyper-parameter search for the ﬂood level with candidates from {0.00, 0.01, . . . , 0.10}.