In-Context Freeze-Thaw Bayesian Optimization for Hyperparameter Optimization
Authors: Herilalaina Rakotoarison, Steven Adriaensen, Neeratyoy Mallik, Samir Garibov, Eddie Bergman, Frank Hutter
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical analysis across three benchmark suites shows that the predictions made by FT-PFN are more accurate and 10-100 times faster than those of the deep Gaussian process and deep ensemble surrogates used in previous work. |
| Researcher Affiliation | Academia | 1Machine Learning Lab, University of Freiburg, Germany 2ELLIS Institute Tübingen. |
| Pseudocode | Yes | Algorithm 1 Freeze-thaw Bayesian Optimization. |
| Open Source Code | Yes | The code for the surrogate PFN training and reproducing experiments from this paper, is available at: https:// github.com/automl/if BO. |
| Open Datasets | Yes | We conduct our experiments on three benchmarks: LCBench (Zimmer et al., 2021), PD1 (Wang et al., 2021), and Taskset (Metz et al., 2020). |
| Dataset Splits | Yes | A single meta-training example in our setting corresponds to a training set Dtrain and test set Dtest, where Dtrain = Sλ Λ (λ, b bmax ), πcurve(λ, b bmax ) bλ b=1 corresponds to the (synthetic) partial learning curves observed thus far (i.e., the analog of H at test time) and Dtest Sλ Λ{((λ, b bmax ), πcurve(λ, b bmax ))}bmax b=bλ the extrapolation targets we want FT-PFN to predict. To keep the input size of FT-PFN fixed we choose |Dtrain| + |Dtest| = N = 1, 000 and vary the size of |Dtrain| U(0, N − 1). |
| Hardware Specification | Yes | The evaluation was run on a single Intel Xeon 6242 CPU. Training took roughly 8 GPU hours on an RTX2080 GPU and the same FT-PFN is used in all experiments described in Section 5, without any retraining/fine-tuning. |
| Software Dependencies | No | The paper mentions using a 'sequence Transformer', 'Adam optimizer', and 'cosine annealing', but does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | We use a standard training procedure for all experiments, minimizing the cross-entropy loss from Equation 1 on 2.0M synthethic datasets generated as described in Section A.2, using the Adam optimizer (Kingma et al., 2015) (learning rate 0.0001, batch size 25) with cosine annealing (Loshchilov & Hutter, 2017) with a linear warmup over the first 25% epochs of the training. |