Improved Analysis of Clipping Algorithms for Non-convex Optimization
Authors: Bohang Zhang, Jikai Jin, Cong Fang, Liwei Wang
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments confirm the superiority of clipping-based methods in deep learning tasks. We conduct extensive experiments and find the clipping algorithms indeed consistently outperform their unclipped counterpart. We present experimental results on three deep learning benchmarks: CIFAR-10 classification using Res Net-32, Imagenet classification using Res Net-50 and language modeling on Penn Treebank (PTB) dataset using AWD-LSTM. |
| Researcher Affiliation | Academia | Bohang Zhang Key Laboratory of Machine Perception, MOE, School of EECS, Peking University zhangbohang@pku.edu.cn Jikai Jin School of Mathematical Sciences Peking University jkjin@pku.edu.cn Cong Fang University of Pennsylvania fangcong@pku.edu.cn Liwei Wang Key Laboratory of Machine Perception, MOE, School of EECS, Peking University Center of Data Science, Peking University wanglw@cis.pku.edu.cn |
| Pseudocode | Yes | Algorithm 1: The General Clipping Framework |
| Open Source Code | Yes | Our code is available at https://github.com/zbh2047/clipping-algorithms. |
| Open Datasets | Yes | CIFAR-10 classification using Res Net-32, Imagenet classification using Res Net-50 and language modeling on Penn Treebank (PTB) dataset using AWD-LSTM. |
| Dataset Splits | Yes | CIFAR-10 classification using Res Net-32, Imagenet classification using Res Net-50 and language modeling on Penn Treebank (PTB) dataset using AWD-LSTM. |
| Hardware Specification | No | The paper states 'We use batch size 256 on 4 GPUs.' but does not specify the model of the GPUs or any other hardware components. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We set learning rate η = 1.0, momentum β = 0.9 and minibatch size 128, following the common practice. For all the clipping algorithms, we choose the best η and γ based on a course grid search, while keeping other hyper-parameters and training strategy the same as SGD+momentum. We simply set the hyperparameters ν = 0.7 and β = 0.999 in mixed clipping, as suggested in Ma and Yarats [2018] (for its unclipped counterpart QHM). |