Near-optimal Policy Identification in Active Reinforcement Learning

Authors: Xiang Li, Viraj Mehta, Johannes Kirschner, Ian Char, Willie Neiswanger, Jeff Schneider, Andreas Krause, Ilija Bogunovic

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, we demonstrate that AE-LSVI outperforms other RL algorithms in a variety of environments when robustness to the initial state is required. 7 EXPERIMENTS In the previous sections we presented the AE-LSVI algorithm, which provably identifies a near-optimal policy in polynomial time given access to a generative model of the MDP dynamics. Here, we test the AE-LSVI algorithm empirically, and additionally provide one of the first empirical evaluation of the LSVI-UCB method from Yang et al. (2020) on standard benchmarks.
Researcher Affiliation Academia 1ETH Zurich, 2Carnegie Mellon Uni., 3Uni. of Alberta,4Stanford Uni., 5Uni. College London
Pseudocode Yes Algorithm 1 AE-LSVI (Active Exploration with Least-Squares Value Iteration) Algorithm 2 AE-LSVI for offline contextual Bayesian optimization
Open Source Code Yes The supplementary material includes the source code for the experiments. It also includes a requirements file and README with full instructions on how to run the RL and BO experiments.
Open Datasets No Although we are not allowed to provide the data used for running the β Tracking and β + Rotation experiments at this time, all other experiments can be run using the provided code.
Dataset Splits No The paper describes using initial state distributions p0 and p'0 for evaluation of policies but does not specify explicit training, validation, and testing splits (e.g., percentages or counts) for a dataset.
Hardware Specification No The paper does not specify any particular hardware components such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No The paper mentions software like 'Tiny GP package', 'JAX', and 'Adam' but does not provide specific version numbers for these or other key software components in the main text. It mentions a 'requirements file' is included with supplementary material, but the paper itself doesn't list the versions.
Experiment Setup Yes All methods besides DDQN are initialized by executing a random policy for two episodes. The experiments are conducted with a default exploration bonus β = 0.5. We use an exact Gaussian Process with a squared exponential kernel with learned scale parameters. We fit the kernel hyperparameters at each iteration using 1000 iterations of Adam, maximizing the marginal log likelihood. We uniformly sample 1,000 points from the state space. For DDQN and BDQN, we use networks with two hidden layers, each with 256 units. BDQN uses 10 different heads.