Over-parameterized Model Optimization with Polyak-{\L}ojasiewicz Condition

Authors: Yixuan Chen, Yubin Shi, Mingzhi Dong, Xiaochen Yang, Dongsheng Li, Yujiang Wang, Robert P. Dick, Qin Lv, Yingying Zhao, Fan Yang, Ning Gu, Li Shang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental studies demonstrate that the proposed method outperforms the baselines in terms of both training efficiency and test performance, exhibiting the potential of generalizing to a variety of deep network architectures and tasks.
Researcher Affiliation Collaboration 1China and Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai, China. 2School of Mathematics Statistics, The University of Glasgow, Glasgow, UK. 3Microsoft Research Asia, Shanghai, China. 4Department of Engineering Science, University of Oxford, Oxford, England. 5Department of Electrical Engineering and Computer Science, University of Michigan, Michigan, United States. 6Department of Computer Science, University of Colorado Boulder, Boulder, Colorado, United States. 7School of Microelectronics, Fudan University, Shanghai, China.
Pseudocode Yes Algorithm 1 Hessian Trace Computation.
Open Source Code No The paper does not contain an explicit statement about the release of its source code or a link to a code repository for the methodology described.
Open Datasets Yes We train BERT with self-supervised masked language modeling (MLM) task (Devlin et al., 2018) on Wiki Text-2 (Merity et al., 2016). Our method (Ours (experts only)) significantly outperforms all the baselines in both training efficiency and generalization ability, demonstrating our method can be applicable to Mo E architecture. In particular, compared with the vanilla Switch-Transformer, when used on the Mo E architecture, our method improves training perplexity by 62%, 41%, 24% in 50 k, 60 k, and 70 k iterations, while improving test performance by 13%. This phenomenon indicates that while the sparsely-activated Mo E architecture selects different experts for each data point, there also exist certain poorly-behaved experts. By modeling the optimization dynamics which guarantee the convergence and generalization ability, the proposed PL regularization-driven pruning method can disable underor over-specialized experts and keep well-behaved experts. We also investigate the benefits of implementing PL regularization on both heads of MHA and experts of Mo E, as shown in the last line in Table 2. Compared with the vanilla Switch-Transformer, implementing PL regularization on both heads and experts presents a 28% improvement in training perplexity after 70 k iterations and a 13% improvement in test performance. This result further demonstrates the proposed PL regularization is flexible and applicable to different architectures. This experiment focuses on VGG-16 (Simonyan & Zisserman, 2014) trained on CIFAR-10 and CIFAR-100 datasets. All models are trained from scratch and are pruned with a linear schedule where starting from epoch-15, we prune the same number of filters at each epoch until the target sparsity is reached. More details of baselines and experimental settings can be found in Appendix E.4. Additional experimental results of Res Net-56 can be found in Appendix F.1.
Dataset Splits No The paper mentions using training and test sets from standard datasets like Wiki Text-2, Wiki Text-103, CIFAR-10, and CIFAR-100, but it does not explicitly provide the specific percentages or sample counts for training, validation, and test splits, nor does it specify any splitting methodology.
Hardware Specification Yes These experiments are conducted on 8 GPUs of NVIDIA Ge Force RTX 3090.
Software Dependencies No The paper mentions software components implicitly through usage (e.g., 'Adam' optimizer) and references (e.g., 'PyTorch' in citations), but it does not specify explicit version numbers for any key software dependencies or libraries.
Experiment Setup Yes The training hyperparameters are presented in Table 4. Hyperparameter BERT Switch Transformer Number of layers 12 12 Hidden size 768 768 Attention heads 12 12 Dropout 0.1 0.1 Sequence length 512 512 Batch size 8 8 Warmup steps 0% 6% Weight decay 0 1e-2 Peak learning rate 1e-4 2e-4 Learning rate decay Linear Cosine Adam [ϵ, β1, β2] [0, 0, 0] [1e-6, 0.9, 0.999] Number of experts 4 Capacity factor 1.5