Towards Learning Universal Hyperparameter Optimizers with Transformers

Authors: Yutian Chen, Xingyou Song, Chansoo Lee, Zi Wang, Richard Zhang, David Dohan, Kazuya Kawakami, Greg Kochanski, Arnaud Doucet, Marc'Aurelio Ranzato, Sagi Perel, Nando de Freitas

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments demonstrate that the OPTFORMER can simultaneously imitate at least 7 different HPO algorithms, which can be further improved via its function uncertainty estimates. Extensive experiments on both public and private datasets demonstrate the OPTFORMER s competitive tuning and generalization abilities. We evaluate mainly on the two natural HPO benchmarks, Real World Data and HPO-B. Section 6 is titled "Experiments".
Researcher Affiliation Industry Yutian Chen1, Xingyou Song2, Chansoo Lee2, Zi Wang2, Qiuyi Zhang2, David Dohan2, Kazuya Kawakami1, Greg Kochanski2, Arnaud Doucet1, Marc aurelio Ranzato1, Sagi Perel2, Nando de Freitas1 1Deepmind, 2Google Research, Brain Team
Pseudocode No The paper describes the model architecture and processes (e.g., tokenization, inference), but it does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code Yes Code: https://github.com/google-research/optformer.
Open Datasets Yes The emergence of public machine learning data platforms such as Open ML [1] and hyperparameter optimization (HPO) services such as Google Vizier [2]... have made large-scale datasets containing hyperparameter evaluations accessible. In addition, we create two new datasets based on public benchmarks. HPO-B is the largest public benchmark for HPO containing about 1.9K tuning tasks... [5]. For further control over specific function dimensions and properties, we use the blackbox optimization benchmark BBOB [48]. The datasets generated on public benchmarks, BBOB and HPO-B, can be reproduced by running publicly available HPO algorithms.
Dataset Splits No The train/test subsets of Real World Data are split temporally to avoid information leak (see Appendix C for details). The paper does not explicitly specify a validation dataset split (e.g., percentages or counts).
Hardware Specification Yes Our model is implemented in T5x [30] and trained on TPU-v4 chips. The full model has 250M parameters and is trained for 2M steps with a batch size of 2048.
Software Dependencies No We adopt the T5 Transformer encoder-decoder architecture [30]. The shortened text string is then converted to a sequence of tokens via the Sentence Piece tokenizer [44]. Our model is implemented in T5x [30] and trained on TPU-v4 chips. The paper mentions software tools like T5x and SentencePiece but does not provide specific version numbers for them.
Experiment Setup Yes We train a single Transformer model with 250M parameters on the union of the three datasets described above, Real World Data, HPO-B, and BBOB (hyperparameter details in Appendix D.2). The full model has 250M parameters and is trained for 2M steps with a batch size of 2048. We sample M = 100 candidate suggestions from πprior. For the historical sequence h, we convert every DOUBLE and INTEGER parameter along with every function value into a single token, by normalizing and discretizing them into integers, with an quantization level of Q = 1000.