reproducibilityindex.ai

Unique Properties of Flat Minima in Deep Networks

Authors: Rotem Mulayoff, Tomer Michaeli

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our theoretical results apply to linear networks. Yet, as we now empirically illustrate, they also nicely capture the behavior of nonlinear networks. To show this, we trained fully connected networks with Re LU activation functions to denoise images of handwritten digits. We used the MNIST dataset (Le Cun, 1998) and simulated zero-mean white Gaussian noise of standard deviation 1.25, where the pixel range of the clean images was [0, 1]. ... Figure 4 visualizes the result of training networks of different depths using varying step sizes. For each conﬁguration, we measured the top eigenvalue of the Hessian using the power method.
Researcher Affiliation	Academia	Rotem Mulayoff 1 Tomer Michaeli 1 1Department of Electrical Engineering, Technion Israel Institute of Technology, Haifa, Israel.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described.
Open Datasets	Yes	We used the MNIST dataset (Le Cun, 1998) and simulated zero-mean white Gaussian noise of standard deviation 1.25, where the pixel range of the clean images was [0, 1].
Dataset Splits	No	The paper mentions using the MNIST dataset but does not specify exact training, validation, or test dataset splits.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions software components like Re LU activation functions, SGD, and Adam, but it does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9).
Experiment Setup	Yes	We minimized the quadratic loss using SGD without momentum. ... We trained a six layer network for the same denoising problem as above, using two different optimization methods: (i) SGD with a large step size and moderate batch size, a conﬁguration that is known to converge to ﬂat minima (Keskar et al., 2016); (ii) Adam (Kingma & Ba, 2014) with a small step size, which can converge to sharp minima (Wu et al., 2018). ... In the experiment above, we used identity initialization, as Lemma 1 suggests this should lead to a ﬂat minimum. To verify that this is indeed the case, we repeated the experiment with the initialization of He et al. (2015).