AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights
Authors: Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, Jung-Woo Ha
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Given the ubiquity of momentum GD and scale invariance in machine learning, we have evaluated our methods against the baselines on 13 benchmarks. They range from vision tasks like classification (e.g. Image Net), retrieval (e.g. CUB and SOP), and detection (e.g. COCO) to language modelling (e.g. Wiki Text) and audio classification (e.g. DCASE) tasks. We verify that our solution brings about uniform gains in performances in those benchmarks. |
| Researcher Affiliation | Collaboration | Naver AI Lab1, Naver Clova2 Applied Information Engineering, Yonsei University3 |
| Pseudocode | Yes | The proposed method is readily adaptable to existing gradient-based optimization algorithms like SGD and Adam. Their modifications, SGDP and Adam P, are shown in Algorithms 1 and 2, respectively (Modifications are colorized). |
| Open Source Code | Yes | Source code is available at https://github.com/clovaai/adamp. |
| Open Datasets | Yes | Image Net1K benchmark (Russakovsky et al., 2015)... MS-COCO dataset (Lin et al., 2014)... CIFAR-10... Magna Tag ATune (MTAT) dataset (Law et al., 2009)... Speech Commands dataset (Warden, 2018)... DCASE 2017 challenge (Mesaros et al., 2017)... CUB (Wah et al., 2011), Cars-196 (Krause et al., 2013), In-Shop (Liu et al., 2016b), and SOP (Oh Song et al., 2016) benchmarks. |
| Dataset Splits | Yes | We have searched the best hyperparameters for the Adam optimizer on the MTAT validation dataset and have transferred them to Adam P experiments. |
| Hardware Specification | Yes | The training sessions are run for 100 epochs (Res Net18, Res Net50) or 150 epochs (Mobile Net V2, Res Net50 + Cut Mix) with the cosine learning rate schedule (Loshchilov & Hutter, 2016) on a machine with four NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper states 'All experiments are conducted based on Py Torch.' but does not provide specific version numbers for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | Experiments on Res Net (He et al., 2016) are conducted based on the standard settings : learning rate 0.1, weight decay 10 4, batch-size 256, momentum 0.9 with Nesterov (Sutskever et al., 2013) for SGD and SGDP. For Adam series, we use the learning rate 0.001, weight decay 10 4, batch-size 256, β1 0.9, β2 0.999, ϵ 10 8. |