Improving Deep Learning Optimization through Constrained Parameter Regularization
Authors: Jörg Franke, Michael Hefenbrock, Gregor Koehler, Frank Hutter
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical studies on computer vision and language modeling tasks demonstrate CPR s effectiveness. The results show that CPR can outperform traditional weight decay and increase performance in pre-training and fine-tuning. |
| Researcher Affiliation | Collaboration | Jörg K.H. Franke University of Freiburg, Germany Michael Hefenbrock Revo AI, Karlsruhe, Germany Gregor Koehler German Cancer Research Center (DKFZ) Heidelberg, Germany Frank Hutter ELLIS Institute Tübingen, Germany University of Freiburg, Germany |
| Pseudocode | Yes | Algorithm 1 Optimization with constrained parameter regularization (CPR) . Algorithm 2 Optimization with constrained parameter regularization (CPR) and Kappa-WS . Algorithm 3 Optimization with adaptive bound constrained parameter regularization ( Ada CPR ). |
| Open Source Code | Yes | 1Please find our implementation under https://github.com/automl/CPR. |
| Open Datasets | Yes | To evaluate CPR s effectiveness and design choices, we tested Adam W and Adam with CPR (Adam CPR) in image classification using a Res Net18 on the CIFAR100 dataset [25, 26]. |
| Dataset Splits | Yes | We found that setting the CPR warm start steps s to twice the warm-up steps is a good initial choice. For very low warm-up steps, the best s was four times the warm-up count. Conversely, with a long warm-up phase, a shorter CPR warm start ( 1) is preferable. Notably, the optimal choice of s is almost independent of the learning rate, as shown in Figure E.3. The optimal warm start steps are consistent across a wide range of learning rates. |
| Hardware Specification | Yes | For example, training the small model on 4 A100 GPUs took 14.85h for Adam W and 14.89h for Adam CPR. The GPT2s and GPT2m models are trained on 8 A100 GPUs up to 28h. |
| Software Dependencies | No | The paper mentions using PyTorch and specific libraries like 'PyTorch Image Models library [31]' and 'flash attention [45]', but does not specify exact version numbers for these software dependencies. |
| Experiment Setup | Yes | We use a learning rate warm-up of 500 steps and the best Kappa-WS value is 2 the warm-up steps. We report the mean of three runs with random seeds. We provide details of the training and hyperparameters in Appendix H. |