Gaussian Process Bandits with Aggregated Feedback

Authors: Mengyan Zhang, Russell Tsuchida, Cheng Soon Ong9074-9081

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate GPOO and compare it with related algorithms on simulated data. In 5, we compare our algorithm with related algorithms in a simulated environment 1. Our algorithm shows the best empirical performance in terms of aggregated regret. We investigate the empirical performance of GPOO on simulated data. We compare the aggregated regret obtained by GPOO with related algorithms, and illustrate how different parameters influence performance. We show regret curves of different algorithms for two simulated reward functions in Figure 4.
Researcher Affiliation Collaboration Mengyan Zhang1,2, Russell Tsuchida2, Cheng Soon Ong1,2 1 The Australian National University 2 Data61, CSIRO
Pseudocode Yes Algorithm 1: GPOO
Open Source Code Yes 1Source code and Appendix are available at https://github.com/ Mengyanz/GPOO
Open Datasets No The paper states: 'The reward functions are sampled from a GP posterior conditioned on hand-designed samples (listed in Appendix E )'. This does not provide concrete access information (link, DOI, repository) for a publicly available or open dataset.
Dataset Splits No The paper uses simulated data and does not specify exact train/validation/test percentages, absolute sample counts for each split, or reference predefined splits with citations for data partitioning. It mentions 'We perform 30 independent runs with a budget N up to 80' but this is not about data splitting.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4).
Experiment Setup Yes The reward functions are sampled from a GP posterior conditioned on hand-designed samples (listed in Appendix E ), with radial basis function (RBF) kernel having lengthscale 0.05 and variance 0.1. The GP noise standard variation was set to 0.005. The reward noise is i.i.d. sampled zero-mean Gaussian distribution with standard deviation 0.1. For our experiment, we consider two cases of feedback: single point feedback (S = 1), where the reward is sampled from the centre (representative point) of the selected cell and average feedback (S = 10), where the reward is the average of samples from the centre of each sub-cell, which are obtained by splitting the cell into intervals of equal size. Following Corollary 1, we choose δ(h) = cρh, where c is chosen via cross-validation. The algorithms are evaluated by the aggregated regret in Definition 2 (S = 1 for single point feedback, S = 10 for average feedback). The error probability needed for Sto OO is chosen to be 0.1. We perform 30 independent runs with a budget N up to 80.