reproducibilityindex.ai

Large Scale Markov Decision Processes with Changing Rewards

Authors: Adrian Rivera Cardoso, He Wang, Huan Xu

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves a regret bound of O( p τ(ln \|S\| + ln \|A\|)T ln(T)), where S is the state space, A is the action space, τ is the mixing time of the MDP, and T is the number of periods. The algorithm s computational complexity is polynomial in \|S\| and \|A\|. We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension d \|S\|, we propose a modiﬁed algorithm with a computational complexity polynomial in d and independent of \|S\|. We also prove a regret bound for this modiﬁed algorithm, which to the best of our knowledge, is the ﬁrst O( T) regret bound in the large-scale MDP setting with adversarially changing rewards.
Researcher Affiliation	Collaboration	Adrian Rivera Cardoso, He Wang School of Industrial and Systems Engineering Georgia Institute of Technology adrian.riv@gatech.edu, he.wang@isye.gatech.edu Huan Xu Alibaba Group huan.xu@alibaba-inc.com
Pseudocode	Yes	Algorithm 1 (MDP-RFTL) input: parameter δ > 0, η > 0, regularization term R(µ) = P a A µ(s, a) ln(µ(s, a)) initialization: choose any µ1 M,δ R\|S\| \|A\| for t = 1, ...T do... Algorithm 2 (LARGE-MDP-RFTL) input: matrix Φ, parameters: η, δ > 0, convex function Rδ(µ), SGA step-size schedule {wt}T t=0, penalty term parameters {Ht}T t=1 initialize: θ1 PSGA( Rδ(Φθ) + V (θ), ΘΦ, w0, K0) for t = 1, ..., T do... Algorithm 3 Projected Stochastic Gradient Ascent: PSGA(f, X, w, K) input: concave objective function f, feasible set X, stepsize w, x1 X for k = 1, ...K do...
Open Source Code	No	The paper does not provide any information about open-source code availability for the described methodology.
Open Datasets	No	The paper is theoretical and does not involve empirical evaluation on a dataset; therefore, it does not mention public datasets or training data.
Dataset Splits	No	The paper is theoretical and does not involve empirical evaluation with data; therefore, it does not discuss training, validation, or test splits.
Hardware Specification	No	The paper is theoretical and does not describe running experiments, hence no hardware specifications are mentioned.
Software Dependencies	No	The paper is theoretical and does not describe running experiments that would require specific software dependencies with version numbers.
Experiment Setup	No	The paper is theoretical and does not involve an experimental setup with hyperparameters or training settings.