Scaling Laws for Hyperparameter Optimization
Authors: Arlind Kadra, Maciej Janowski, Martin Wistuba, Josif Grabocka
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare our method against 7 state-of-the-art competitors on 3 benchmarks related to tabular, image, and NLP datasets covering 59 diverse tasks. Our method achieves the best results across all benchmarks by obtaining the best any-time results compared to all competitors. |
| Researcher Affiliation | Collaboration | Arlind Kadra Representation Learning Lab University of Freiburg kadraa@cs.uni-freiburg.de,Maciej Janowski Representation Learning Lab University of Freiburg janowski@cs.uni-freiburg.de,Martin Wistuba Amazon Web Services Amazon Berlin marwistu@amazon.com,Josif Grabocka Representation Learning Lab University of Freiburg grabocka@cs.uni-freiburg.de |
| Pseudocode | Yes | Algorithm 1: Multi-Fidelity HPO with Deep Power Laws |
| Open Source Code | Yes | Our implementation of DPL is publicly available.1 1https://github.com/releaunifreiburg/DPL |
| Open Datasets | Yes | LCBench: A benchmark that features 2,000 hyperparameter configurations that parametrize the architecture of simple feedforward neural networks, as well as, the training pipeline [51].,PD1: A deep learning benchmark [45] that consists of recent DL (including Transformers) architectures run on large vision datasets such as CIFAR-10, CIFAR-100, Image Net, as well as statistical modeling corpora and protein sequence datasets from bioinformatics.,Task Set: A benchmark that features different optimization tasks evaluated in 5 different search spaces [34]. |
| Dataset Splits | Yes | Hyperparameter Optimization (HPO) demands finding the configurations λ Λ of a Machine Learning method that achieve the lowest validation loss L(Val) of a model (e.g. a neural network),by discarding poorly-performing hyperparameter configurations after observing the validation error on the low-level fidelities of the optimization procedure [28, 9, 1, 29]. During the training phase, the batch size is set to 64, while for the validation phase, it is reduced to 8. |
| Hardware Specification | Yes | We ran experiments on a CPU cluster, where every node contains two Intel Xeon E5-2630v4 CPUs with 20 CPU cores running at 2.2 GHz. The total memory of every node is 120GB, and every experiment is limited to 2 cores which offer 12GB.,Within these constraints, we focus our experiments on NVIDIA RTX 2080 GPUs. |
| Software Dependencies | Yes | We use version 0.7.4 of the Hp Band Ster library as a common codebase for all 3 baselines. |
| Experiment Setup | Yes | For our method, we use an ensemble of 5 models, where every model consists of a 2-layer feedforward neural network with 128 units per layer and Leaky Re LU for the non-linearity.,We use the L1 loss to train our network, coupled with Adam featuring an initial learning rate of 10 3. For the first 10 iterations of our multi-fidelity HPO method in Algorithm 1 we train every network of our ensemble for 250 epochs with randomly sampled initial weights.,The learning rate is initiated at 10 6 and gradually increased over a span of five warmup epochs to reach the learning rate value of 5 10 4. Following the warmup phase, we employ a cosine learning rate scheduler, with a decay factor of 0.97 applied every 10 epochs. The weight decay is set at 10 5, with no momentum used. Furthermore, the dropout rate is configured to be 10 6 and the model s moving average exponential decay is set at 0.9996. During the training phase, the batch size is set to 64, while for the validation phase, it is reduced to 8. |