Large Scale Markov Decision Processes with Changing Rewards

Authors: Adrian Rivera Cardoso, He Wang, Huan Xu

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves a regret bound of O( p τ(ln |S| + ln |A|)T ln(T)), where S is the state space, A is the action space, τ is the mixing time of the MDP, and T is the number of periods. The algorithm s computational complexity is polynomial in |S| and |A|. We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension d |S|, we propose a modified algorithm with a computational complexity polynomial in d and independent of |S|. We also prove a regret bound for this modified algorithm, which to the best of our knowledge, is the first O( T) regret bound in the large-scale MDP setting with adversarially changing rewards.
Researcher Affiliation Collaboration Adrian Rivera Cardoso, He Wang School of Industrial and Systems Engineering Georgia Institute of Technology adrian.riv@gatech.edu, he.wang@isye.gatech.edu Huan Xu Alibaba Group huan.xu@alibaba-inc.com
Pseudocode Yes Algorithm 1 (MDP-RFTL) input: parameter δ > 0, η > 0, regularization term R(µ) = P a A µ(s, a) ln(µ(s, a)) initialization: choose any µ1 M,δ R|S| |A| for t = 1, ...T do... Algorithm 2 (LARGE-MDP-RFTL) input: matrix Φ, parameters: η, δ > 0, convex function Rδ(µ), SGA step-size schedule {wt}T t=0, penalty term parameters {Ht}T t=1 initialize: θ1 PSGA( Rδ(Φθ) + V (θ), ΘΦ, w0, K0) for t = 1, ..., T do... Algorithm 3 Projected Stochastic Gradient Ascent: PSGA(f, X, w, K) input: concave objective function f, feasible set X, stepsize w, x1 X for k = 1, ...K do...
Open Source Code No The paper does not provide any information about open-source code availability for the described methodology.
Open Datasets No The paper is theoretical and does not involve empirical evaluation on a dataset; therefore, it does not mention public datasets or training data.
Dataset Splits No The paper is theoretical and does not involve empirical evaluation with data; therefore, it does not discuss training, validation, or test splits.
Hardware Specification No The paper is theoretical and does not describe running experiments, hence no hardware specifications are mentioned.
Software Dependencies No The paper is theoretical and does not describe running experiments that would require specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and does not involve an experimental setup with hyperparameters or training settings.