Obtaining Adjustable Regularization for Free via Iterate Averaging
Authors: Jingfeng Wu, Vladimir Braverman, Lin Yang
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical studies on both synthetic and real datasets verify our theory. Moreover, we test iterate averaging with modern deep neural networks on CIFAR-10 and CIFAR-100 datasets, and the proposed approaches still obtain effective and adjustable regularization effects with little additional computation, demonstrating the broad applicability of our methods. |
| Researcher Affiliation | Academia | 1Johns Hopkins University, Baltimore, MD, USA 2University of California, Los Angeles, CA, USA. |
| Pseudocode | No | The paper describes algorithms and update rules using mathematical equations (e.g., equations 1, 2, 3, 4, 5, 6), but it does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/ uuujf/Iter Avg. |
| Open Datasets | Yes | We then present experiments on the MNIST dataset. [...] We train VGG-16 (Simonyan & Zisserman, 2014) and Res Net-18 (He et al., 2016) on CIFAR-10 and CIFAR-100 datasets... |
| Dataset Splits | No | The paper mentions training on CIFAR-10 and CIFAR-100 and testing on MNIST, but it does not explicitly provide details about validation dataset splits, such as percentages or specific sample counts for training, validation, and testing. |
| Hardware Specification | Yes | The running times are measured by performing the experiments using a single GPU K80. |
| Software Dependencies | No | The paper mentions the use of specific models like VGG-16 and ResNet-18, and standard tricks like batch normalization, but it does not provide specific version numbers for any software dependencies, libraries, or frameworks used (e.g., PyTorch version, TensorFlow version, CUDA version). |
| Experiment Setup | Yes | The models are trained for 300 epochs using SGD. We perform epoch averaging using the 240 checkpoints saved from the 61st to the 300th epoch. The first 60 epochs are skipped since the models in the early phase are extremely unstable. After averaging the parameters, we apply a trick proposed by Izmailov et al. (2018) to handle the batch normalization statistics which are not trained by SGD. Specifically, we make a forward pass on the training data to compute the activation statistics for the batch normalization layers. For the choice of averaging scheme, we test standard geometric distribution with success probability p {0.9999, 0.999, 0.99, 0.9}. |