Improved Analysis of Clipping Algorithms for Non-convex Optimization

Authors: Bohang Zhang, Jikai Jin, Cong Fang, Liwei Wang

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments confirm the superiority of clipping-based methods in deep learning tasks. We conduct extensive experiments and find the clipping algorithms indeed consistently outperform their unclipped counterpart. We present experimental results on three deep learning benchmarks: CIFAR-10 classification using Res Net-32, Imagenet classification using Res Net-50 and language modeling on Penn Treebank (PTB) dataset using AWD-LSTM.
Researcher Affiliation Academia Bohang Zhang Key Laboratory of Machine Perception, MOE, School of EECS, Peking University zhangbohang@pku.edu.cn Jikai Jin School of Mathematical Sciences Peking University jkjin@pku.edu.cn Cong Fang University of Pennsylvania fangcong@pku.edu.cn Liwei Wang Key Laboratory of Machine Perception, MOE, School of EECS, Peking University Center of Data Science, Peking University wanglw@cis.pku.edu.cn
Pseudocode Yes Algorithm 1: The General Clipping Framework
Open Source Code Yes Our code is available at https://github.com/zbh2047/clipping-algorithms.
Open Datasets Yes CIFAR-10 classification using Res Net-32, Imagenet classification using Res Net-50 and language modeling on Penn Treebank (PTB) dataset using AWD-LSTM.
Dataset Splits Yes CIFAR-10 classification using Res Net-32, Imagenet classification using Res Net-50 and language modeling on Penn Treebank (PTB) dataset using AWD-LSTM.
Hardware Specification No The paper states 'We use batch size 256 on 4 GPUs.' but does not specify the model of the GPUs or any other hardware components.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We set learning rate η = 1.0, momentum β = 0.9 and minibatch size 128, following the common practice. For all the clipping algorithms, we choose the best η and γ based on a course grid search, while keeping other hyper-parameters and training strategy the same as SGD+momentum. We simply set the hyperparameters ν = 0.7 and β = 0.999 in mixed clipping, as suggested in Ma and Yarats [2018] (for its unclipped counterpart QHM).