Large Scale Markov Decision Processes with Changing Rewards
Authors: Adrian Rivera Cardoso, He Wang, Huan Xu
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves a regret bound of O( p τ(ln |S| + ln |A|)T ln(T)), where S is the state space, A is the action space, τ is the mixing time of the MDP, and T is the number of periods. The algorithm s computational complexity is polynomial in |S| and |A|. We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension d |S|, we propose a modified algorithm with a computational complexity polynomial in d and independent of |S|. We also prove a regret bound for this modified algorithm, which to the best of our knowledge, is the first O( T) regret bound in the large-scale MDP setting with adversarially changing rewards. |
| Researcher Affiliation | Collaboration | Adrian Rivera Cardoso, He Wang School of Industrial and Systems Engineering Georgia Institute of Technology adrian.riv@gatech.edu, he.wang@isye.gatech.edu Huan Xu Alibaba Group huan.xu@alibaba-inc.com |
| Pseudocode | Yes | Algorithm 1 (MDP-RFTL) input: parameter δ > 0, η > 0, regularization term R(µ) = P a A µ(s, a) ln(µ(s, a)) initialization: choose any µ1 M,δ R|S| |A| for t = 1, ...T do... Algorithm 2 (LARGE-MDP-RFTL) input: matrix Φ, parameters: η, δ > 0, convex function Rδ(µ), SGA step-size schedule {wt}T t=0, penalty term parameters {Ht}T t=1 initialize: θ1 PSGA( Rδ(Φθ) + V (θ), ΘΦ, w0, K0) for t = 1, ..., T do... Algorithm 3 Projected Stochastic Gradient Ascent: PSGA(f, X, w, K) input: concave objective function f, feasible set X, stepsize w, x1 X for k = 1, ...K do... |
| Open Source Code | No | The paper does not provide any information about open-source code availability for the described methodology. |
| Open Datasets | No | The paper is theoretical and does not involve empirical evaluation on a dataset; therefore, it does not mention public datasets or training data. |
| Dataset Splits | No | The paper is theoretical and does not involve empirical evaluation with data; therefore, it does not discuss training, validation, or test splits. |
| Hardware Specification | No | The paper is theoretical and does not describe running experiments, hence no hardware specifications are mentioned. |
| Software Dependencies | No | The paper is theoretical and does not describe running experiments that would require specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not involve an experimental setup with hyperparameters or training settings. |