KOALA: A Kalman Optimization Algorithm with Loss Adaptivity

Authors: Aram Davtyan, Sepehr Sameni, Llukman Cerkezi, Givi Meishvili, Adam Bielski, Paolo Favaro6471-6479

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide convergence analysis and show experimentally that it yields parameter estimates that are on par with or better than existing state of the art optimization algorithms across several neural network architectures and machine learning tasks, such as computer vision and language modeling. In this section we ablate the following features and parameters of both KOALA-V and KOALA-M algorithms: the dynamics of the weights and velocities, the initialization of the posterior covariance matrix and the adaptivity of the state noise estimators. We evaluate KOALA-M on different tasks, including image classification (on CIFAR-10, CIFAR-100 and Image Net (Russakovsky et al. 2015)), generative learning and language modeling.
Researcher Affiliation Academia Aram Davtyan, Sepehr Sameni, Llukman Cerkezi, Givi Meishvili, Adam Bielski, Paolo Favaro Computer Vision Group, University of Bern, Switzerland {aram.davtyan, sepehr.sameni, llukman.cerkezi, givi.meishvili, adam.bielski, paolo.favaro}@inf.unibe.ch
Pseudocode Yes Algorithm 1: KOALA-V (Vanilla)
Open Source Code Yes The project page with the code and the supplementary materials is available at https://araachie.github.io/koala/.
Open Datasets Yes We evaluate KOALA-M on different tasks, including image classification (on CIFAR-10, CIFAR-100 and Image Net (Russakovsky et al. 2015)). In all the ablations, we choose the classification task on CIFAR-100 (Krizhevsky and Hinton 2009).
Dataset Splits No The paper reports "Top-1 and Top-5 errors on the validation set" and mentions training for a specific number of epochs, but does not explicitly detail the split percentages or counts for the validation dataset from the overall dataset.
Hardware Specification No The paper does not provide specific details on the hardware used for experiments, such as GPU models, CPU specifications, or cloud computing instances with their configurations.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, etc.) within the main text.
Experiment Setup Yes We train all the models for 100 epochs and decrease the learning rate by a factor of 0.2 every 30 epochs. For SGD we set the momentum rate to 0.9, which is the default for many popular networks, and for Adam we use the default parameters β1 = 0.9, β2 = 0.999, ϵ = 10 8. In all experiments on CIFAR-10/100, we use a batch size of 128 and basic data augmentation (random horizontal flipping and random cropping with padding by 4 pixels). For all the algorithms, we additionally use a weight decay of 0.0005.