reproducibilityindex.ai

Towards AutoAI: Optimizing a Machine Learning System with Black-box and Differentiable Components

Authors: Zhiliang Chen, Chuan-Sheng Foo, Bryan Kian Hsiang Low

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use A-BAD-BO to optimize several synthetic and real-world complex systems, including a prompt engineering pipeline for large language models containing millions of system parameters. Our results demonstrate that A-BADBO yields better system optimality than gradientdriven baselines and is more sample-efficient than pure BO algorithms.
Researcher Affiliation	Academia	1Department of Computer Science, National University of Singapore, Singapore 2Institute for Infocomm Research, ASTAR, Singapore 3Centre for Frontier AI Research, ASTAR, Singapore. Correspondence to: Zhiliang Chen <chenzhiliang@u.nus.edu>.
Pseudocode	Yes	Algorithm 1 A-BAD-BO
Open Source Code	Yes	More details on each system can be found in App. D.1and our code can be found at https: //github.com/chenzhiliang94/A-BAD-BO.
Open Datasets	Yes	In MNIST system, we use the MNIST dataset (Deng, 2012) to train 2 ML components as binary digit classifiers. ... In Healthcare system, we use several healthcare models whose local datasets are open source healthcare related datasets; they include a body fat prediction model (Penrose et al., 1985), hepatitis risk prediction model (Hepatitis, 1988), kidney disease prediction model (Soundarapandian & Eswaran, 2015) and health disease prediction model (Siddhartha, 2020).
Dataset Splits	No	The paper mentions 'fixed datasets' and 'system test dataset' (Sec. 2.2) and that local datasets are used for training (Sec. 2.2), but does not provide explicit training, validation, or test split percentages or sample counts for reproducibility.
Hardware Specification	No	The paper mentions 'Chat GPT' (an LLM) and 'Distil BERT' models, which are typically run on powerful hardware like GPUs, but it does not provide any specific details about the GPU models, CPU models, memory, or other hardware specifications used for their own experiments.
Software Dependencies	No	The paper mentions using 'Py Torch' (Sec. 3.6, App. D.2, Table 3) and 'Chat GPT' (Sec. 1, D.1) and 'Distil BERT' (Sec. 6), but does not specify version numbers for any of these software dependencies or libraries, which is crucial for reproducibility.
Experiment Setup	Yes	We use a sampling size of k = 5 for optimizing the LLM system and k = 10 for other systems. For comparison fairness, we use the same number of system queries in all approaches and plot the best system loss achieved after each system query (Fig. 3). ... For comparison fairness, we plot the convergence of Tu RBO after 200 iterations (since A-BAD-BO uses more system queries per BO iteration).