In-Context Freeze-Thaw Bayesian Optimization for Hyperparameter Optimization

Authors: Herilalaina Rakotoarison, Steven Adriaensen, Neeratyoy Mallik, Samir Garibov, Eddie Bergman, Frank Hutter

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical analysis across three benchmark suites shows that the predictions made by FT-PFN are more accurate and 10-100 times faster than those of the deep Gaussian process and deep ensemble surrogates used in previous work.
Researcher Affiliation Academia 1Machine Learning Lab, University of Freiburg, Germany 2ELLIS Institute Tübingen.
Pseudocode Yes Algorithm 1 Freeze-thaw Bayesian Optimization.
Open Source Code Yes The code for the surrogate PFN training and reproducing experiments from this paper, is available at: https:// github.com/automl/if BO.
Open Datasets Yes We conduct our experiments on three benchmarks: LCBench (Zimmer et al., 2021), PD1 (Wang et al., 2021), and Taskset (Metz et al., 2020).
Dataset Splits Yes A single meta-training example in our setting corresponds to a training set Dtrain and test set Dtest, where Dtrain = Sλ Λ (λ, b bmax ), πcurve(λ, b bmax ) bλ b=1 corresponds to the (synthetic) partial learning curves observed thus far (i.e., the analog of H at test time) and Dtest Sλ Λ{((λ, b bmax ), πcurve(λ, b bmax ))}bmax b=bλ the extrapolation targets we want FT-PFN to predict. To keep the input size of FT-PFN fixed we choose |Dtrain| + |Dtest| = N = 1, 000 and vary the size of |Dtrain| U(0, N − 1).
Hardware Specification Yes The evaluation was run on a single Intel Xeon 6242 CPU. Training took roughly 8 GPU hours on an RTX2080 GPU and the same FT-PFN is used in all experiments described in Section 5, without any retraining/fine-tuning.
Software Dependencies No The paper mentions using a 'sequence Transformer', 'Adam optimizer', and 'cosine annealing', but does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks.
Experiment Setup Yes We use a standard training procedure for all experiments, minimizing the cross-entropy loss from Equation 1 on 2.0M synthethic datasets generated as described in Section A.2, using the Adam optimizer (Kingma et al., 2015) (learning rate 0.0001, batch size 25) with cosine annealing (Loshchilov & Hutter, 2017) with a linear warmup over the first 25% epochs of the training.