Information-Theoretic Confidence Bounds for Reinforcement Learning

Authors: Xiuyuan Lu, Benjamin Van Roy

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We integrate information-theoretic concepts into the design and analysis of optimistic algorithms and Thompson sampling. By making a connection between information-theoretic quantities and confidence bounds, we obtain results that relate the per-period performance of the agent with its information gain about the environment, thus explicitly characterizing the exploration-exploitation tradeoff. The resulting cumulative regret bound depends on the agent s uncertainty over the environment and quantifies the value of prior information. We show applicability of this approach to several environments, including linear bandits, tabular MDPs, and factored MDPs. These examples demonstrate the potential of a general information-theoretic approach for the design and analysis of reinforcement learning algorithms.
Researcher Affiliation Academia Xiuyuan Lu Stanford University lxy@stanford.edu Benjamin Van Roy Stanford University bvr@stanford.edu
Pseudocode Yes Algorithm 1 Thompson Sampling 1: Input: prior p 2: for ℓ= 1, 2, . . . , L do 3: Sample: ˆθℓ p 4: Act: Aℓ= arg max a A 5: Observe: Yℓ,Aℓ 6: Update: p P(θ |p, Aℓ, Yℓ,Aℓ) 7: end for
Open Source Code No The paper does not provide any specific links to source code repositories or explicitly state that the code for the described methodology is publicly available.
Open Datasets No This paper is theoretical and focuses on analysis and derivation of bounds rather than empirical evaluation on specific datasets. Therefore, no concrete access information for a publicly available or open dataset is provided.
Dataset Splits No This is a theoretical paper that focuses on analytical bounds and framework development rather than empirical experiments. As such, no specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for training, validation, or testing is provided.
Hardware Specification No The paper does not provide any specific hardware details (e.g., GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running experiments.
Software Dependencies No The paper does not provide any specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment.
Experiment Setup No The paper describes theoretical frameworks and their applications to different problem types (linear bandits, tabular MDPs, factored MDPs) but does not include concrete hyperparameter values, training configurations, or system-level settings for a practical experimental setup.