Unique Properties of Flat Minima in Deep Networks
Authors: Rotem Mulayoff, Tomer Michaeli
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theoretical results apply to linear networks. Yet, as we now empirically illustrate, they also nicely capture the behavior of nonlinear networks. To show this, we trained fully connected networks with Re LU activation functions to denoise images of handwritten digits. We used the MNIST dataset (Le Cun, 1998) and simulated zero-mean white Gaussian noise of standard deviation 1.25, where the pixel range of the clean images was [0, 1]. ... Figure 4 visualizes the result of training networks of different depths using varying step sizes. For each configuration, we measured the top eigenvalue of the Hessian using the power method. |
| Researcher Affiliation | Academia | Rotem Mulayoff 1 Tomer Michaeli 1 1Department of Electrical Engineering, Technion Israel Institute of Technology, Haifa, Israel. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | We used the MNIST dataset (Le Cun, 1998) and simulated zero-mean white Gaussian noise of standard deviation 1.25, where the pixel range of the clean images was [0, 1]. |
| Dataset Splits | No | The paper mentions using the MNIST dataset but does not specify exact training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like Re LU activation functions, SGD, and Adam, but it does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9). |
| Experiment Setup | Yes | We minimized the quadratic loss using SGD without momentum. ... We trained a six layer network for the same denoising problem as above, using two different optimization methods: (i) SGD with a large step size and moderate batch size, a configuration that is known to converge to flat minima (Keskar et al., 2016); (ii) Adam (Kingma & Ba, 2014) with a small step size, which can converge to sharp minima (Wu et al., 2018). ... In the experiment above, we used identity initialization, as Lemma 1 suggests this should lead to a flat minimum. To verify that this is indeed the case, we repeated the experiment with the initialization of He et al. (2015). |