Improving Deep Learning Optimization through Constrained Parameter Regularization

Authors: Jörg Franke, Michael Hefenbrock, Gregor Koehler, Frank Hutter

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical studies on computer vision and language modeling tasks demonstrate CPR s effectiveness. The results show that CPR can outperform traditional weight decay and increase performance in pre-training and fine-tuning.
Researcher Affiliation Collaboration Jörg K.H. Franke University of Freiburg, Germany Michael Hefenbrock Revo AI, Karlsruhe, Germany Gregor Koehler German Cancer Research Center (DKFZ) Heidelberg, Germany Frank Hutter ELLIS Institute Tübingen, Germany University of Freiburg, Germany
Pseudocode Yes Algorithm 1 Optimization with constrained parameter regularization (CPR) . Algorithm 2 Optimization with constrained parameter regularization (CPR) and Kappa-WS . Algorithm 3 Optimization with adaptive bound constrained parameter regularization ( Ada CPR ).
Open Source Code Yes 1Please find our implementation under https://github.com/automl/CPR.
Open Datasets Yes To evaluate CPR s effectiveness and design choices, we tested Adam W and Adam with CPR (Adam CPR) in image classification using a Res Net18 on the CIFAR100 dataset [25, 26].
Dataset Splits Yes We found that setting the CPR warm start steps s to twice the warm-up steps is a good initial choice. For very low warm-up steps, the best s was four times the warm-up count. Conversely, with a long warm-up phase, a shorter CPR warm start ( 1) is preferable. Notably, the optimal choice of s is almost independent of the learning rate, as shown in Figure E.3. The optimal warm start steps are consistent across a wide range of learning rates.
Hardware Specification Yes For example, training the small model on 4 A100 GPUs took 14.85h for Adam W and 14.89h for Adam CPR. The GPT2s and GPT2m models are trained on 8 A100 GPUs up to 28h.
Software Dependencies No The paper mentions using PyTorch and specific libraries like 'PyTorch Image Models library [31]' and 'flash attention [45]', but does not specify exact version numbers for these software dependencies.
Experiment Setup Yes We use a learning rate warm-up of 500 steps and the best Kappa-WS value is 2 the warm-up steps. We report the mean of three runs with random seeds. We provide details of the training and hyperparameters in Appendix H.