Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Scaling Laws for Hyperparameter Optimization
Authors: Arlind Kadra, Maciej Janowski, Martin Wistuba, Josif Grabocka
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare our method against 7 state-of-the-art competitors on 3 benchmarks related to tabular, image, and NLP datasets covering 59 diverse tasks. Our method achieves the best results across all benchmarks by obtaining the best any-time results compared to all competitors. |
| Researcher Affiliation | Collaboration | Arlind Kadra Representation Learning Lab University of Freiburg EMAIL,Maciej Janowski Representation Learning Lab University of Freiburg EMAIL,Martin Wistuba Amazon Web Services Amazon Berlin EMAIL,Josif Grabocka Representation Learning Lab University of Freiburg EMAIL |
| Pseudocode | Yes | Algorithm 1: Multi-Fidelity HPO with Deep Power Laws |
| Open Source Code | Yes | Our implementation of DPL is publicly available.1 1https://github.com/releaunifreiburg/DPL |
| Open Datasets | Yes | LCBench: A benchmark that features 2,000 hyperparameter configurations that parametrize the architecture of simple feedforward neural networks, as well as, the training pipeline [51].,PD1: A deep learning benchmark [45] that consists of recent DL (including Transformers) architectures run on large vision datasets such as CIFAR-10, CIFAR-100, Image Net, as well as statistical modeling corpora and protein sequence datasets from bioinformatics.,Task Set: A benchmark that features different optimization tasks evaluated in 5 different search spaces [34]. |
| Dataset Splits | Yes | Hyperparameter Optimization (HPO) demands finding the configurations λ Λ of a Machine Learning method that achieve the lowest validation loss L(Val) of a model (e.g. a neural network),by discarding poorly-performing hyperparameter configurations after observing the validation error on the low-level fidelities of the optimization procedure [28, 9, 1, 29]. During the training phase, the batch size is set to 64, while for the validation phase, it is reduced to 8. |
| Hardware Specification | Yes | We ran experiments on a CPU cluster, where every node contains two Intel Xeon E5-2630v4 CPUs with 20 CPU cores running at 2.2 GHz. The total memory of every node is 120GB, and every experiment is limited to 2 cores which offer 12GB.,Within these constraints, we focus our experiments on NVIDIA RTX 2080 GPUs. |
| Software Dependencies | Yes | We use version 0.7.4 of the Hp Band Ster library as a common codebase for all 3 baselines. |
| Experiment Setup | Yes | For our method, we use an ensemble of 5 models, where every model consists of a 2-layer feedforward neural network with 128 units per layer and Leaky Re LU for the non-linearity.,We use the L1 loss to train our network, coupled with Adam featuring an initial learning rate of 10 3. For the first 10 iterations of our multi-fidelity HPO method in Algorithm 1 we train every network of our ensemble for 250 epochs with randomly sampled initial weights.,The learning rate is initiated at 10 6 and gradually increased over a span of five warmup epochs to reach the learning rate value of 5 10 4. Following the warmup phase, we employ a cosine learning rate scheduler, with a decay factor of 0.97 applied every 10 epochs. The weight decay is set at 10 5, with no momentum used. Furthermore, the dropout rate is configured to be 10 6 and the model s moving average exponential decay is set at 0.9996. During the training phase, the batch size is set to 64, while for the validation phase, it is reduced to 8. |