Exploring the Inefficiency of Heavy Ball as Momentum Parameter Approaches 1

Authors: Xiaoge Deng, Tao Sun, Dongsheng Li, Xicheng Lu

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments employing several machine learning models, including the ℓ2-regularized multi-class logistic regression and a multi-layer perceptron (MLP) for MNIST classification [Le Cun and Cortes, 2010]. Figure 1 illustrates the behavior of the minibatch approximation of E[RS(wk) RS(w )] (the expectation is obtained via averaging over five independent runs) with wk obtained from SGD or SHB with different β. Table 1 presents the test accuracy of the two models trained by the four different optimizers, and the training process is illustrated in Figure 7 (Res Net34).
Researcher Affiliation Academia Xiaoge Deng , Tao Sun , Dongsheng Li and Xicheng Lu College of Computer Science and Technology, National University of Defense Technology, China dengxg@nudt.edu.cn, suntao.saltfish@outlook.com, {dsli, xclu}@nudt.edu.cn
Pseudocode Yes Algorithm 1 Stochastic Heavy Ball (SHB), Algorithm 2 Stochastic Heavy Ball with Descending Warmup (SHB-DW)
Open Source Code No The paper does not contain any statement or link indicating the release of open-source code for the described methodology.
Open Datasets Yes MNIST classification [Le Cun and Cortes, 2010], Res Net18 [He et al., 2016a] model to classify the CIFAR10 [Krizhevsky et al., 2009] dataset.
Dataset Splits No The paper mentions 'The dataset contains 60,000 and 10,000 grayscale images in the training and test sets.' for MNIST and '50,000 and 10,000 colored images in the training and test sets, respectively.' for CIFAR10. It specifies batch sizes but does not explicitly detail a separate validation split or the percentages for a training/validation/test split.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory, or cloud instances) used to run the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software components, libraries, or programming languages used in the experiments.
Experiment Setup Yes The training batch size is set to 256 for all the following tasks. We apply both SGD and SHB with a learning rate of 0.01 and use different momentum parameters β for SHB to train the multiclass logistic regression classifier with ℓ2-regularization (set as 0.01) on the weights. The algorithms are run for 200 epochs with a batch size of 256. The initial learning rate for Adam is set to 0.001, while the others are set to 0.1.