Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Large Language Models to Enhance Bayesian Optimization
Authors: Tennison Liu, Nicolás Astorga, Nabeel Seedat, Mihaela van der Schaar
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate LLAMBO s efficacy on the problem of hyperparameter tuning, highlighting strong empirical performance across a range of diverse benchmarks, proprietary, and synthetic tasks. |
| Researcher Affiliation | Academia | Tennison Liu , Nicol as Astorga , Nabeel Seedat & Mihaela van der Schaar DAMTP, University of Cambridge Cambirdge, UK EMAIL |
| Pseudocode | No | The paper describes methods in text and uses figures to illustrate concepts, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We provide the code to reproduce our results at https://github.com/tennisonliu/LLAMBO and the wider lab repository https://github.com/vanderschaarlab/LLAMBO. |
| Open Datasets | Yes | We conduct our investigations using 74 tasks extracted from Bayesmark and HPOBench [31, 32] and Open AI s GPT-3.5 Language Model (see Appendix D for detailed experimental procedures). ... We utilize Bayesmark [31] as a continuous HPT benchmark. ... We included HPOBench, specifically the tabular benchmarks for computationally efficient evaluations [32]. ... Additionally, we introduce 3 proprietary (SEER [72], MAGGIC [73], and CUTRACT [74]) and 3 synthetic datasets. |
| Dataset Splits | No | The paper references standard benchmarks (Bayesmark, HPOBench) which usually have predefined splits, but does not explicitly state the training/validation/test split percentages or methodology for their own experiments beyond mentioning testing predictions against 'unseen points' or using '5 initialization points'. |
| Hardware Specification | Yes | For context, all runtime measurements were conducted on an Intel i7-1260P (a consumer-grade laptop). |
| Software Dependencies | Yes | We conduct our investigations using 74 tasks extracted from Bayesmark and HPOBench [31, 32] and Open AI s GPT-3.5 Language Model (see Appendix D for detailed experimental procedures). ... For our experiments, we used gpt-3.5-turbo, version 0301 with default hyperparameters temperature = 0.7 and top p = 0.95. ... SKOpt (GP-based) [68]: ... Version 0.9.0. GP (Deep Kernel Learning) [48]: ... (Bo Torch version 0.8.5). SMAC3 [8]: ... Version 1.4.0. ... HEBO [77]: ... Version 0.3.5. Optuna [41]. ... Version 3.3.0. |
| Experiment Setup | Yes | Experimental setup. We conduct our investigations using 74 tasks extracted from Bayesmark and HPOBench [31, 32] and Open AI s GPT-3.5 Language Model (see Appendix D for detailed experimental procedures). ... Each search begins with 5 initialization points and proceeds for 25 trials, and we report average results over ten seeded searches. ... For our instantiation of LLAMBO, we sample M = 20 candidate points, and set the exploration hyperparameter to α = 0.1. ... For our experiments, we used gpt-3.5-turbo, version 0301 with default hyperparameters temperature = 0.7 and top p = 0.95. |