Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad

Authors: Sayantan Choudhury, Nazarii Tupitsa, Nicolas Loizou, Samuel Horváth, Martin Takac, Eduard Gorbunov

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also compare KATE to other state-of-the-art adaptive algorithms Adam and Ada Grad in numerical experiments with different problems, including complex machine learning tasks like image classification and text classification on real data. 4 Numerical Experiments In this section, we implement KATE in several machine learning tasks to evaluate its performance.
Researcher Affiliation Academia Sayantan Choudhury MBZUAI & Johns Hopkins University Nazarii Tupitsa MBZUAI & Innopolis University Nicolas Loizou Johns Hopkins University Samuel Horváth MBZUAI Martin Takáˇc MBZUAI Eduard Gorbunov MBZUAI
Pseudocode Yes Algorithm 1 KATE
Open Source Code Yes To ensure transparency and facilitate reproducibility, we provide an access to the source code for all of our experiments at https://github.com/nazya/KATE.
Open Datasets Yes We test KATE on three datasets: heart, australian, and splice from the LIBSVM library (Chang and Lin, 2011). training Res Net18 (He et al., 2016) on the CIFAR10 dataset (Krizhevsky and Hinton, 2009) and BERT (Devlin et al., 2018) fine-tuning on the emotions dataset (Saravia et al., 2018) from the Hugging Face Hub.
Dataset Splits No The paper uses standard datasets like CIFAR10 and LIBSVM datasets, which typically have predefined splits. However, it does not explicitly state the training, validation, and test splits used for its experiments (e.g., specific percentages, sample counts, or explicit mention of 'standard' split being applied with a citation).
Hardware Specification Yes We use internal cluster with the following hardware: AMD EPYC 7552 48-Core Processor, 512Gi B RAM, NVIDIA A100 40GB GPU, 200gb user storage space.
Software Dependencies No The paper mentions 'PyTorch' as the framework used for Adam default values, but it does not specify exact version numbers for PyTorch or any other software dependencies required to replicate the experiments.
Experiment Setup Yes We choose standard parameters for Adam (β1 = 0.9 and β2 = 0.999) that are default values in Py Torch and select the learning rate of 10 5 for all considered methods. We run KATE with different values of η {0, 10 1, 10 2}. For the image classification task, we normalize the images (similar to Horváth and Richtárik (2020)) and use a mini-batch size of 500. For the BERT fine-tuning, we use a mini-batch size 160 for all methods.