Entropic gradient descent algorithms and wide flat minima
Authors: Fabrizio Pittorino, Carlo Lucibello, Christoph Feinauer, Gabriele Perugini, Carlo Baldassi, Elizaveta Demyanenko, Riccardo Zecchina
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study two algorithms, Entropy-SGD and Replicated-SGD, that explicitly include the local entropy in the optimization objective. We devise a training schedule by which we consistently find flatter minima (using both flatness measures), and improve the generalization error for common architectures (e.g. Res Net, Efficient Net). In this section, we explore in detail the connection between the two flatness measures and the generalization properties in a one-hidden-layer network that performs a binary classification task, also called a committee machine. This model has a symmetry that allows to fix all the weights in the last layer to 1, and thus only the first layer is trained. It is also invariant to rescaling of the weights. This allows to study its typical properties analytically with statistical mechanics techniques, and it was shown in Baldassi et al. (2020) that it has a rich non-convex error-loss landscape, in which rare flat minima coexist with narrower ones. It is amenable to be studied semi-analytically: for individual instances, the minimizers found by different algorithms can be compared by computing their local entropy efficiently with the Belief Propagation (BP) algorithm (see Appendix B.1), bypassing the need to perform the integral in Eq. (1) explicitly. Doing the same for general architectures is an open problem. For a network with K hidden units, the output predicted for a given input pattern x reads: We follow the numerical setting of Baldassi et al. (2020) and train this network to perform binary classification on two classes of the Fashion-MNIST dataset with binarized patterns, comparing the results of standard SGD with cross-entropy loss (CE) with the entropic counterparts r SGD and e SGD. In this section we show that, by optimizing the local entropy with e SGD and r SGD, we are able to systematically improve the generalization performance compared to standard SGD. We perform experiments on image classification tasks, using common benchmark datasets, state-of-the-art deep architectures and the usual cross-entropy loss. The detailed settings of the experiments are reported in the SM. For the experiments with e SGD and r SGD, we use the same settings and hyper-parameters (architecture, dropout, learning rate schedule,...) as for the baseline, unless otherwise stated in the SM and apart from the hyper-parameters specific to these algorithms. |
| Researcher Affiliation | Academia | Fabrizio Pittorino1,2, Carlo Lucibello1, Christoph Feinauer1, Gabriele Perugini1, Carlo Baldassi1, Elizaveta Demyanenko1, Riccardo Zecchina1 1AI Lab, Institute for Data Science and Analytics, Bocconi University, 20136 Milano, Italy 2Dept. Applied Science and Technology, Politecnico di Torino, 10129 Torino, Italy |
| Pseudocode | Yes | Algorithm 1: Entropy-SGD (e SGD) Input : w Hyper-parameters : L, η, γ, η , ϵ, α and Algorithm 2: Replicated-SGD (r SGD) Input : {wa} Hyper-parameters : y, η, γ, K |
| Open Source Code | Yes | While we hope to foster the application of entropic algorithms by publishing code that can be used to adapt them easily to new architectures, we also believe that the numeric results are important for theoretical research, since they are rooted in a well-defined geometric interpretation of the loss landscape. |
| Open Datasets | Yes | Fashion-MNIST dataset, CIFAR-10, CIFAR-100, Tiny Image Net |
| Dataset Splits | No | The paper frequently mentions 'training set' and 'test set' and reports 'test set error' and 'train error difference'. However, it does not explicitly provide details for a 'validation set' or specific dataset splits that include a validation portion. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory amounts, or cloud instance types) used for running the experiments are provided in the paper. The paper only discusses software implementations and datasets. |
| Software Dependencies | Yes | All experiments are implemented using Py Torch (Paszke et al., 2019). default parameter initialization (Py Torch 1.3). |
| Experiment Setup | Yes | The detailed settings of the experiments are reported in the SM. For the experiments with e SGD and r SGD, we use the same settings and hyper-parameters (architecture, dropout, learning rate schedule,...) as for the baseline, unless otherwise stated in the SM and apart from the hyper-parameters specific to these algorithms. We used the following hyper-parameters for the various algorithms: SGD fast: η = 2 10 4, β0 = 2.0, β1 = 10 4, ω0 = 5.0, ω1 = 0.0; SGD slow: η = 3 10 5, β0 = 0.5, β1 = 10 3, ω0 = 0.5, ω1 = 10 3; r SGD fast: η = 10 4, y = 10, γ0 = 2 10 3, γ1 = 2 10 3, β0 = 1.0, β1 = 2 10 4, ω0 = 0.5, ω1 = 10 3; r SGD slow: η = 10 3, y = 10, γ0 = 10 4, γ1 = 10 4, β0 = 1.0, β1 = 2 10 4, ω0 = 0.5, ω1 = 10 3; e SGD: η = 10 3, η = 5 10 3, ϵ = 10 6, L = 20, γ0 = 10.0, γ1 = 5 10 5, β0 = 1.0, β1 = 10 4, ω0 = 0.5, ω1 = 5 10 4; In all experiments, the loss L is the usual cross-entropy and the parameter initialization is Kaiming normal. We normalize images in the train and test sets by the mean and variance over the train set. We also apply random crops (of width w if image size is w w, with zero-padding of size 4 for CIFAR and 8 for Tiny Image Net) and random horizontal flips. In the following we refer to the latter procedure by the name 'standard preprocessing'. All experiments are implemented using Py Torch (Paszke et al., 2019). |