Over-parameterized Model Optimization with Polyak-{\L}ojasiewicz Condition
Authors: Yixuan Chen, Yubin Shi, Mingzhi Dong, Xiaochen Yang, Dongsheng Li, Yujiang Wang, Robert P. Dick, Qin Lv, Yingying Zhao, Fan Yang, Ning Gu, Li Shang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental studies demonstrate that the proposed method outperforms the baselines in terms of both training efficiency and test performance, exhibiting the potential of generalizing to a variety of deep network architectures and tasks. |
| Researcher Affiliation | Collaboration | 1China and Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai, China. 2School of Mathematics Statistics, The University of Glasgow, Glasgow, UK. 3Microsoft Research Asia, Shanghai, China. 4Department of Engineering Science, University of Oxford, Oxford, England. 5Department of Electrical Engineering and Computer Science, University of Michigan, Michigan, United States. 6Department of Computer Science, University of Colorado Boulder, Boulder, Colorado, United States. 7School of Microelectronics, Fudan University, Shanghai, China. |
| Pseudocode | Yes | Algorithm 1 Hessian Trace Computation. |
| Open Source Code | No | The paper does not contain an explicit statement about the release of its source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | We train BERT with self-supervised masked language modeling (MLM) task (Devlin et al., 2018) on Wiki Text-2 (Merity et al., 2016). Our method (Ours (experts only)) significantly outperforms all the baselines in both training efficiency and generalization ability, demonstrating our method can be applicable to Mo E architecture. In particular, compared with the vanilla Switch-Transformer, when used on the Mo E architecture, our method improves training perplexity by 62%, 41%, 24% in 50 k, 60 k, and 70 k iterations, while improving test performance by 13%. This phenomenon indicates that while the sparsely-activated Mo E architecture selects different experts for each data point, there also exist certain poorly-behaved experts. By modeling the optimization dynamics which guarantee the convergence and generalization ability, the proposed PL regularization-driven pruning method can disable underor over-specialized experts and keep well-behaved experts. We also investigate the benefits of implementing PL regularization on both heads of MHA and experts of Mo E, as shown in the last line in Table 2. Compared with the vanilla Switch-Transformer, implementing PL regularization on both heads and experts presents a 28% improvement in training perplexity after 70 k iterations and a 13% improvement in test performance. This result further demonstrates the proposed PL regularization is flexible and applicable to different architectures. This experiment focuses on VGG-16 (Simonyan & Zisserman, 2014) trained on CIFAR-10 and CIFAR-100 datasets. All models are trained from scratch and are pruned with a linear schedule where starting from epoch-15, we prune the same number of filters at each epoch until the target sparsity is reached. More details of baselines and experimental settings can be found in Appendix E.4. Additional experimental results of Res Net-56 can be found in Appendix F.1. |
| Dataset Splits | No | The paper mentions using training and test sets from standard datasets like Wiki Text-2, Wiki Text-103, CIFAR-10, and CIFAR-100, but it does not explicitly provide the specific percentages or sample counts for training, validation, and test splits, nor does it specify any splitting methodology. |
| Hardware Specification | Yes | These experiments are conducted on 8 GPUs of NVIDIA Ge Force RTX 3090. |
| Software Dependencies | No | The paper mentions software components implicitly through usage (e.g., 'Adam' optimizer) and references (e.g., 'PyTorch' in citations), but it does not specify explicit version numbers for any key software dependencies or libraries. |
| Experiment Setup | Yes | The training hyperparameters are presented in Table 4. Hyperparameter BERT Switch Transformer Number of layers 12 12 Hidden size 768 768 Attention heads 12 12 Dropout 0.1 0.1 Sequence length 512 512 Batch size 8 8 Warmup steps 0% 6% Weight decay 0 1e-2 Peak learning rate 1e-4 2e-4 Learning rate decay Linear Cosine Adam [ϵ, β1, β2] [0, 0, 0] [1e-6, 0.9, 0.999] Number of experts 4 Capacity factor 1.5 |