Quantum Multi-Armed Bandits and Stochastic Linear Bandits Enjoy Logarithmic Regrets
Authors: Zongqi Wan, Zhijie Zhang, Tongyang Li, Jialin Zhang, Xiaoming Sun
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments to demonstrate the performance of our two quantum variants of bandit algorithms. For simplicity, we use the Bernoulli rewards in both bandit settings. |
| Researcher Affiliation | Academia | Institute of Computing Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Center for Applied Mathematics of Fujian Province, School of Mathematics and Statistics, Fuzhou University 4Center on Frontiers of Computing Studies, Peking University 5School of Computer Science, Peking University |
| Pseudocode | Yes | Algorithm 1: QUCB1(δ) Parameters: fail probability δ 1: for i = 1 n do 2: ri 1 and Ni C1 δ 3: play arm i for consecutive Ni rounds 4: run QMC1(Oi, ri, δ) to get an estimation ˆµ(i) for µ(i) 5: end for 6: for each stage s = 1, 2, . . . (terminate when we have used T queries to all Oi) do 7: Let is argmaxi ˆµ(i) + ri (if argmax has multiple choices, pick an arbitrary one) 8: update ris ris/2 and Nis C1 δ 9: Play is for the next Nis rounds, update ˆµ(is) by runing QMC1(Ois, ris, δ) 10: end for |
| Open Source Code | No | The paper mentions using |
| Open Datasets | No | The paper describes setting up instances for experiments (e.g., |
| Dataset Splits | No | The paper does not specify any training, validation, or test dataset splits. It discusses |
| Hardware Specification | Yes | Our experiments are executed on a computer equipped with Xeon E5-2620 CPU and 64GB memory. |
| Software Dependencies | No | The paper mentions |
| Experiment Setup | Yes | For the QMAB setting, we run UCB and QUCB on a 2-arm bandit for T = 10^6 rounds... we set the reward gap of our experimental instance relatively small. Overall, we set the mean reward of the optimal arm to be 0.5, then we let the reward gaps of the sub-optimal arm be 0.01. |