KOALA: A Kalman Optimization Algorithm with Loss Adaptivity
Authors: Aram Davtyan, Sepehr Sameni, Llukman Cerkezi, Givi Meishvili, Adam Bielski, Paolo Favaro6471-6479
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide convergence analysis and show experimentally that it yields parameter estimates that are on par with or better than existing state of the art optimization algorithms across several neural network architectures and machine learning tasks, such as computer vision and language modeling. In this section we ablate the following features and parameters of both KOALA-V and KOALA-M algorithms: the dynamics of the weights and velocities, the initialization of the posterior covariance matrix and the adaptivity of the state noise estimators. We evaluate KOALA-M on different tasks, including image classification (on CIFAR-10, CIFAR-100 and Image Net (Russakovsky et al. 2015)), generative learning and language modeling. |
| Researcher Affiliation | Academia | Aram Davtyan, Sepehr Sameni, Llukman Cerkezi, Givi Meishvili, Adam Bielski, Paolo Favaro Computer Vision Group, University of Bern, Switzerland {aram.davtyan, sepehr.sameni, llukman.cerkezi, givi.meishvili, adam.bielski, paolo.favaro}@inf.unibe.ch |
| Pseudocode | Yes | Algorithm 1: KOALA-V (Vanilla) |
| Open Source Code | Yes | The project page with the code and the supplementary materials is available at https://araachie.github.io/koala/. |
| Open Datasets | Yes | We evaluate KOALA-M on different tasks, including image classification (on CIFAR-10, CIFAR-100 and Image Net (Russakovsky et al. 2015)). In all the ablations, we choose the classification task on CIFAR-100 (Krizhevsky and Hinton 2009). |
| Dataset Splits | No | The paper reports "Top-1 and Top-5 errors on the validation set" and mentions training for a specific number of epochs, but does not explicitly detail the split percentages or counts for the validation dataset from the overall dataset. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for experiments, such as GPU models, CPU specifications, or cloud computing instances with their configurations. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, etc.) within the main text. |
| Experiment Setup | Yes | We train all the models for 100 epochs and decrease the learning rate by a factor of 0.2 every 30 epochs. For SGD we set the momentum rate to 0.9, which is the default for many popular networks, and for Adam we use the default parameters β1 = 0.9, β2 = 0.999, ϵ = 10 8. In all experiments on CIFAR-10/100, we use a batch size of 128 and basic data augmentation (random horizontal flipping and random cropping with padding by 4 pixels). For all the algorithms, we additionally use a weight decay of 0.0005. |