AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights

Authors: Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, Jung-Woo Ha

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Given the ubiquity of momentum GD and scale invariance in machine learning, we have evaluated our methods against the baselines on 13 benchmarks. They range from vision tasks like classification (e.g. Image Net), retrieval (e.g. CUB and SOP), and detection (e.g. COCO) to language modelling (e.g. Wiki Text) and audio classification (e.g. DCASE) tasks. We verify that our solution brings about uniform gains in performances in those benchmarks.
Researcher Affiliation Collaboration Naver AI Lab1, Naver Clova2 Applied Information Engineering, Yonsei University3
Pseudocode Yes The proposed method is readily adaptable to existing gradient-based optimization algorithms like SGD and Adam. Their modifications, SGDP and Adam P, are shown in Algorithms 1 and 2, respectively (Modifications are colorized).
Open Source Code Yes Source code is available at https://github.com/clovaai/adamp.
Open Datasets Yes Image Net1K benchmark (Russakovsky et al., 2015)... MS-COCO dataset (Lin et al., 2014)... CIFAR-10... Magna Tag ATune (MTAT) dataset (Law et al., 2009)... Speech Commands dataset (Warden, 2018)... DCASE 2017 challenge (Mesaros et al., 2017)... CUB (Wah et al., 2011), Cars-196 (Krause et al., 2013), In-Shop (Liu et al., 2016b), and SOP (Oh Song et al., 2016) benchmarks.
Dataset Splits Yes We have searched the best hyperparameters for the Adam optimizer on the MTAT validation dataset and have transferred them to Adam P experiments.
Hardware Specification Yes The training sessions are run for 100 epochs (Res Net18, Res Net50) or 150 epochs (Mobile Net V2, Res Net50 + Cut Mix) with the cosine learning rate schedule (Loshchilov & Hutter, 2016) on a machine with four NVIDIA V100 GPUs.
Software Dependencies No The paper states 'All experiments are conducted based on Py Torch.' but does not provide specific version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes Experiments on Res Net (He et al., 2016) are conducted based on the standard settings : learning rate 0.1, weight decay 10 4, batch-size 256, momentum 0.9 with Nesterov (Sutskever et al., 2013) for SGD and SGDP. For Adam series, we use the learning rate 0.001, weight decay 10 4, batch-size 256, β1 0.9, β2 0.999, ϵ 10 8.