Reinforcement Learning in Feature Space: Matrix Bandit, Kernels, and Regret Bound

Authors: Lin Yang, Mengdi Wang

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we propose an online RL algorithm, namely the Matrix RL, that leverages ideas from linear bandit to learn a low-dimensional representation of the probability transition model while carefully balancing the exploitation-exploration tradeoff. We show that Matrix RL achieves a regret bound O H2d log T T where d is the number of features, independent with the number of state-action pairs. Matrix RL has an equivalent kernelized version, which is able to work with an arbitrary kernel Hilbert space without using explicit features. In this case, the kernelized Matrix RL satisfies a regret bound O H2 ed log T T , where ed is the effective dimension of the kernel space.
Researcher Affiliation Academia 1Department of Electrical and Computer Engineering, University of California, Los Angeles 2Department of Electrical Engineering and Center for Statistics and Machine Learning, Princeton University.
Pseudocode Yes Algorithm 1 Upper Confidence Matrix Reinforcement Learning (UC-Matrix RL) and Algorithm 2 Kernel Matrix RL: Reinforcement Learning with Kernels
Open Source Code No The paper does not provide an explicit statement or link for the availability of its source code.
Open Datasets No The paper is theoretical and does not mention using any specific publicly available datasets for training, nor does it provide access information for any dataset.
Dataset Splits No The paper is theoretical and does not describe training, validation, or test dataset splits for experimental reproduction.
Hardware Specification No The paper is theoretical and does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running experiments.
Software Dependencies No The paper is theoretical and does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate experiments.
Experiment Setup No The paper does not provide specific experimental setup details such as concrete hyperparameter values, training configurations, or system-level settings, as it is a theoretical work.