Langevin Thompson Sampling with Logarithmic Communication: Bandits and Reinforcement Learning

Authors: Amin Karbasi, Nikki Lijing Kuang, Yian Ma, Siddharth Mitra

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We complement our theoretical findings with experimental results. 7. Experiments In this section, we perform empirical studies in simulated environments for bandit and RL to corroborate our theoretical findings. By comparing the actual regret (average rewards) and the number of batches for interaction (maximum policy switches), we show Langevin TS algorithms empowered by LMC methods achieve appealing statistical accuracy with low communication cost.
Researcher Affiliation Academia 1Department of Electrical Engineering, Yale University, New Haven, USA 2Department of Computer Science, Yale University, New Haven, USA 3Department of Statistics and Data Science, Yale University, New Haven, USA 4Department of Computer Science and Engineering, University of California San Diego, La Jolla, USA 5Halıcıoˇglu Data Science Institute, University of California San Diego, La Jolla, USA.
Pseudocode Yes Algorithm 1 SGLD with Batched Data ... Algorithm 2 Batched Langevin Thompson Sampling (BLTS) ... Algorithm 3 Langevin PSRL (LPSRL) ... Algorithm 4 Mirrored Langevin Dynamics (MLD)
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We simulate a Gaussian bandit environment with N = 15 arms. ... To demonstrate the applicability of Langevin TS in scenarios where posteriors are intractable, we construct a Laplace bandit environment with N = 10 arms. ... In MDP setting, we consider a variant of River Swim environment (Strehl and Littman, 2008), which is a common testbed for provable RL methods.
Dataset Splits No The paper conducts experiments on simulated environments (Gaussian bandits, Laplace bandits, River Swim) rather than using pre-defined datasets with explicit training/validation/test splits. Performance is evaluated over a time horizon T.
Hardware Specification No The paper mentions running 'empirical studies in simulated environments' but does not provide any specific details about the hardware used for these simulations (e.g., CPU, GPU models, memory, or cloud instances).
Software Dependencies No The paper mentions using specific algorithms like SGLD and MLD and notes that additional experimental details are in Appendix F. However, Appendix F describes environment parameters and prior settings but does not list specific versions of software libraries (e.g., PyTorch, TensorFlow) or programming languages used for implementation.
Experiment Setup No The paper provides theoretical specifications for algorithm parameters (e.g., 'with the SGLD parameters specified as per Algorithm 1 and with γ = O(1/dκ3)' or 'setting the hyperparameters as per Theorem 1'). It also details environment parameters like the number of arms, reward distributions, and prior settings (e.g., 'N = 15 arms', 'expected rewards ... evenly spaced in [1, 20]', 'Gaussian priors ... means evenly spaced in [14, 20], and inverted variance ... set to 0.375'). However, it does not provide specific numerical values for the learning algorithm's operational hyperparameters (e.g., a precise learning rate value for SGLD or number of iterations beyond O() notation) that would typically be part of an experimental setup for reproducibility.