Do We Need Zero Training Loss After Achieving Zero Training Error?
Authors: Takashi Ishida, Ikko Yamane, Tomoya Sakai, Gang Niu, Masashi Sugiyama
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally show that flooding improves performance and, as a byproduct, induces a double descent curve of the test loss. |
| Researcher Affiliation | Collaboration | 1The University of Tokyo 2RIKEN 3NEC Corporation. |
| Pseudocode | Yes | A minimal working example with a mini-batch in Py Torch (Paszke et al., 2019) is demonstrated below to show the additional one line of code: 1 outputs = model(inputs) 2 loss = criterion(outputs, labels) 3 flood = (loss-b).abs()+b # This is it! 4 optimizer.zero_grad() 5 flood.backward() 6 optimizer.step() |
| Open Source Code | Yes | The implementation in this section and the next is based on Py Torch (Paszke et al., 2019) and demo code is available.6 https://github.com/takashiishida/ flooding |
| Open Datasets | Yes | Data We use three types of synthetic data: Two Gaussians, Sinusoid (Nakkiran et al., 2019), and Spiral (Sugiyama, 2015). We use the following benchmark datasets: MNIST, Kuzushiji-MNIST, SVHN, CIFAR-10, and CIFAR100. |
| Dataset Splits | Yes | The training, validation, and test sample sizes are 100, 100, and 20000, respectively. We split the original training dataset into training and validation data with with a proportion of 80 : 20 except for when we used data augmentation, we used 85 : 15. |
| Hardware Specification | Yes | Experiments were carried out with NVIDIA Ge Force GTX 1080 Ti, NVIDIA Quadro RTX 5000 and Intel Xeon Gold 6142. |
| Software Dependencies | No | The paper mentions using "Py Torch (Paszke et al., 2019)" but does not specify a version number for it or any other software dependency. |
| Experiment Setup | Yes | We train the network for 500 epochs with the logistic loss and the Adam (Kingma & Ba, 2015) optimizer with 100 mini-batch size and learning rate of 0.001. The flood level is chosen from b {0, 0.01, 0.01, . . . , 0.50}. Stochastic gradient descent (Robbins & Monro, 1951) is used with learning rate of 0.1 and momentum of 0.9 for 500 epochs. We perform the exhaustive hyper-parameter search for the flood level with candidates from {0.00, 0.01, . . . , 0.10}. |