Towards AutoAI: Optimizing a Machine Learning System with Black-box and Differentiable Components
Authors: Zhiliang Chen, Chuan-Sheng Foo, Bryan Kian Hsiang Low
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use A-BAD-BO to optimize several synthetic and real-world complex systems, including a prompt engineering pipeline for large language models containing millions of system parameters. Our results demonstrate that A-BADBO yields better system optimality than gradientdriven baselines and is more sample-efficient than pure BO algorithms. |
| Researcher Affiliation | Academia | 1Department of Computer Science, National University of Singapore, Singapore 2Institute for Infocomm Research, A*STAR, Singapore 3Centre for Frontier AI Research, A*STAR, Singapore. Correspondence to: Zhiliang Chen <chenzhiliang@u.nus.edu>. |
| Pseudocode | Yes | Algorithm 1 A-BAD-BO |
| Open Source Code | Yes | More details on each system can be found in App. D.1and our code can be found at https: //github.com/chenzhiliang94/A-BAD-BO. |
| Open Datasets | Yes | In MNIST system, we use the MNIST dataset (Deng, 2012) to train 2 ML components as binary digit classifiers. ... In Healthcare system, we use several healthcare models whose local datasets are open source healthcare related datasets; they include a body fat prediction model (Penrose et al., 1985), hepatitis risk prediction model (Hepatitis, 1988), kidney disease prediction model (Soundarapandian & Eswaran, 2015) and health disease prediction model (Siddhartha, 2020). |
| Dataset Splits | No | The paper mentions 'fixed datasets' and 'system test dataset' (Sec. 2.2) and that local datasets are used for training (Sec. 2.2), but does not provide explicit training, validation, or test split percentages or sample counts for reproducibility. |
| Hardware Specification | No | The paper mentions 'Chat GPT' (an LLM) and 'Distil BERT' models, which are typically run on powerful hardware like GPUs, but it does not provide any specific details about the GPU models, CPU models, memory, or other hardware specifications used for their own experiments. |
| Software Dependencies | No | The paper mentions using 'Py Torch' (Sec. 3.6, App. D.2, Table 3) and 'Chat GPT' (Sec. 1, D.1) and 'Distil BERT' (Sec. 6), but does not specify version numbers for any of these software dependencies or libraries, which is crucial for reproducibility. |
| Experiment Setup | Yes | We use a sampling size of k = 5 for optimizing the LLM system and k = 10 for other systems. For comparison fairness, we use the same number of system queries in all approaches and plot the best system loss achieved after each system query (Fig. 3). ... For comparison fairness, we plot the convergence of Tu RBO after 200 iterations (since A-BAD-BO uses more system queries per BO iteration). |