reproducibilityindex.ai

Scaling Laws for Hyperparameter Optimization

Authors: Arlind Kadra, Maciej Janowski, Martin Wistuba, Josif Grabocka

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare our method against 7 state-of-the-art competitors on 3 benchmarks related to tabular, image, and NLP datasets covering 59 diverse tasks. Our method achieves the best results across all benchmarks by obtaining the best any-time results compared to all competitors.
Researcher Affiliation	Collaboration	Arlind Kadra Representation Learning Lab University of Freiburg kadraa@cs.uni-freiburg.de,Maciej Janowski Representation Learning Lab University of Freiburg janowski@cs.uni-freiburg.de,Martin Wistuba Amazon Web Services Amazon Berlin marwistu@amazon.com,Josif Grabocka Representation Learning Lab University of Freiburg grabocka@cs.uni-freiburg.de
Pseudocode	Yes	Algorithm 1: Multi-Fidelity HPO with Deep Power Laws
Open Source Code	Yes	Our implementation of DPL is publicly available.1 1https://github.com/releaunifreiburg/DPL
Open Datasets	Yes	LCBench: A benchmark that features 2,000 hyperparameter configurations that parametrize the architecture of simple feedforward neural networks, as well as, the training pipeline [51].,PD1: A deep learning benchmark [45] that consists of recent DL (including Transformers) architectures run on large vision datasets such as CIFAR-10, CIFAR-100, Image Net, as well as statistical modeling corpora and protein sequence datasets from bioinformatics.,Task Set: A benchmark that features different optimization tasks evaluated in 5 different search spaces [34].
Dataset Splits	Yes	Hyperparameter Optimization (HPO) demands finding the configurations λ Λ of a Machine Learning method that achieve the lowest validation loss L(Val) of a model (e.g. a neural network),by discarding poorly-performing hyperparameter configurations after observing the validation error on the low-level fidelities of the optimization procedure [28, 9, 1, 29]. During the training phase, the batch size is set to 64, while for the validation phase, it is reduced to 8.
Hardware Specification	Yes	We ran experiments on a CPU cluster, where every node contains two Intel Xeon E5-2630v4 CPUs with 20 CPU cores running at 2.2 GHz. The total memory of every node is 120GB, and every experiment is limited to 2 cores which offer 12GB.,Within these constraints, we focus our experiments on NVIDIA RTX 2080 GPUs.
Software Dependencies	Yes	We use version 0.7.4 of the Hp Band Ster library as a common codebase for all 3 baselines.
Experiment Setup	Yes	For our method, we use an ensemble of 5 models, where every model consists of a 2-layer feedforward neural network with 128 units per layer and Leaky Re LU for the non-linearity.,We use the L1 loss to train our network, coupled with Adam featuring an initial learning rate of 10 3. For the first 10 iterations of our multi-fidelity HPO method in Algorithm 1 we train every network of our ensemble for 250 epochs with randomly sampled initial weights.,The learning rate is initiated at 10 6 and gradually increased over a span of five warmup epochs to reach the learning rate value of 5 10 4. Following the warmup phase, we employ a cosine learning rate scheduler, with a decay factor of 0.97 applied every 10 epochs. The weight decay is set at 10 5, with no momentum used. Furthermore, the dropout rate is configured to be 10 6 and the model s moving average exponential decay is set at 0.9996. During the training phase, the batch size is set to 64, while for the validation phase, it is reduced to 8.