reproducibilityindex.ai

Entropic gradient descent algorithms and wide flat minima

Authors: Fabrizio Pittorino, Carlo Lucibello, Christoph Feinauer, Gabriele Perugini, Carlo Baldassi, Elizaveta Demyanenko, Riccardo Zecchina

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study two algorithms, Entropy-SGD and Replicated-SGD, that explicitly include the local entropy in the optimization objective. We devise a training schedule by which we consistently ﬁnd ﬂatter minima (using both ﬂatness measures), and improve the generalization error for common architectures (e.g. Res Net, Eﬃcient Net). In this section, we explore in detail the connection between the two ﬂatness measures and the generalization properties in a one-hidden-layer network that performs a binary classiﬁcation task, also called a committee machine. This model has a symmetry that allows to ﬁx all the weights in the last layer to 1, and thus only the ﬁrst layer is trained. It is also invariant to rescaling of the weights. This allows to study its typical properties analytically with statistical mechanics techniques, and it was shown in Baldassi et al. (2020) that it has a rich non-convex error-loss landscape, in which rare ﬂat minima coexist with narrower ones. It is amenable to be studied semi-analytically: for individual instances, the minimizers found by diﬀerent algorithms can be compared by computing their local entropy eﬃciently with the Belief Propagation (BP) algorithm (see Appendix B.1), bypassing the need to perform the integral in Eq. (1) explicitly. Doing the same for general architectures is an open problem. For a network with K hidden units, the output predicted for a given input pattern x reads: We follow the numerical setting of Baldassi et al. (2020) and train this network to perform binary classiﬁcation on two classes of the Fashion-MNIST dataset with binarized patterns, comparing the results of standard SGD with cross-entropy loss (CE) with the entropic counterparts r SGD and e SGD. In this section we show that, by optimizing the local entropy with e SGD and r SGD, we are able to systematically improve the generalization performance compared to standard SGD. We perform experiments on image classiﬁcation tasks, using common benchmark datasets, state-of-the-art deep architectures and the usual cross-entropy loss. The detailed settings of the experiments are reported in the SM. For the experiments with e SGD and r SGD, we use the same settings and hyper-parameters (architecture, dropout, learning rate schedule,...) as for the baseline, unless otherwise stated in the SM and apart from the hyper-parameters speciﬁc to these algorithms.
Researcher Affiliation	Academia	Fabrizio Pittorino1,2, Carlo Lucibello1, Christoph Feinauer1, Gabriele Perugini1, Carlo Baldassi1, Elizaveta Demyanenko1, Riccardo Zecchina1 1AI Lab, Institute for Data Science and Analytics, Bocconi University, 20136 Milano, Italy 2Dept. Applied Science and Technology, Politecnico di Torino, 10129 Torino, Italy
Pseudocode	Yes	Algorithm 1: Entropy-SGD (e SGD) Input : w Hyper-parameters : L, η, γ, η , ϵ, α and Algorithm 2: Replicated-SGD (r SGD) Input : {wa} Hyper-parameters : y, η, γ, K
Open Source Code	Yes	While we hope to foster the application of entropic algorithms by publishing code that can be used to adapt them easily to new architectures, we also believe that the numeric results are important for theoretical research, since they are rooted in a well-deﬁned geometric interpretation of the loss landscape.
Open Datasets	Yes	Fashion-MNIST dataset, CIFAR-10, CIFAR-100, Tiny Image Net
Dataset Splits	No	The paper frequently mentions 'training set' and 'test set' and reports 'test set error' and 'train error difference'. However, it does not explicitly provide details for a 'validation set' or specific dataset splits that include a validation portion.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory amounts, or cloud instance types) used for running the experiments are provided in the paper. The paper only discusses software implementations and datasets.
Software Dependencies	Yes	All experiments are implemented using Py Torch (Paszke et al., 2019). default parameter initialization (Py Torch 1.3).
Experiment Setup	Yes	The detailed settings of the experiments are reported in the SM. For the experiments with e SGD and r SGD, we use the same settings and hyper-parameters (architecture, dropout, learning rate schedule,...) as for the baseline, unless otherwise stated in the SM and apart from the hyper-parameters speciﬁc to these algorithms. We used the following hyper-parameters for the various algorithms: SGD fast: η = 2 10 4, β0 = 2.0, β1 = 10 4, ω0 = 5.0, ω1 = 0.0; SGD slow: η = 3 10 5, β0 = 0.5, β1 = 10 3, ω0 = 0.5, ω1 = 10 3; r SGD fast: η = 10 4, y = 10, γ0 = 2 10 3, γ1 = 2 10 3, β0 = 1.0, β1 = 2 10 4, ω0 = 0.5, ω1 = 10 3; r SGD slow: η = 10 3, y = 10, γ0 = 10 4, γ1 = 10 4, β0 = 1.0, β1 = 2 10 4, ω0 = 0.5, ω1 = 10 3; e SGD: η = 10 3, η = 5 10 3, ϵ = 10 6, L = 20, γ0 = 10.0, γ1 = 5 10 5, β0 = 1.0, β1 = 10 4, ω0 = 0.5, ω1 = 5 10 4; In all experiments, the loss L is the usual cross-entropy and the parameter initialization is Kaiming normal. We normalize images in the train and test sets by the mean and variance over the train set. We also apply random crops (of width w if image size is w w, with zero-padding of size 4 for CIFAR and 8 for Tiny Image Net) and random horizontal ﬂips. In the following we refer to the latter procedure by the name 'standard preprocessing'. All experiments are implemented using Py Torch (Paszke et al., 2019).