reproducibilityindex.ai

Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes

Authors: Chen-Yu Wei, Mehdi Jafarnia Jahromi, Haipeng Luo, Hiteshi Sharma, Rahul Jain

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, two model-free algorithms are introduced for learning inﬁnite-horizon average-reward Markov Decision Processes (MDPs). The ﬁrst algorithm reduces the problem to the discounted-reward version and achieves O(T 2/3) regret after T steps, under the minimal assumption of weakly communicating MDPs. To our knowledge, this is the ﬁrst model-free algorithm for general MDPs in this setting. The second algorithm makes use of recent advances in adaptive algorithms for adversarial multi-armed bandits and improves the regret to O( T), albeit with a stronger ergodic assumption. This result signiﬁcantly improves over the O(T 3/4) regret achieved by the only existing model-free algorithm by Abbasi-Yadkori et al. (2019a) for ergodic MDPs in the inﬁnitehorizon average-reward setting. ... We also conduct experiments comparing our two algorithms. Details are deferred to Appendix D due to space constraints.
Researcher Affiliation	Academia	Chen-Yu Wei 1 Mehdi Jafarnia-Jahromi 1 Haipeng Luo 1 Hiteshi Sharma 1 Rahul Jain 1 1University of Southern California. Correspondence to: Chen Yu Wei <chenyu.wei@usc.edu>, Mehdi Jafarnia-Jahromi <mjafarni@usc.edu>.
Pseudocode	Yes	Algorithm 1 OPTIMISTIC Q-LEARNING; Algorithm 2 MDP-OOMD; Algorithm 3 ESTIMATEQ; Algorithm 4 OOMDUPDATE.
Open Source Code	No	The paper does not provide an explicit statement or link for the source code of its described methodology. While Appendix D is mentioned for experimental details, it does not state that code is provided there or elsewhere.
Open Datasets	No	The paper mentions conducting experiments, stating 'Details are deferred to Appendix D due to space constraints.' However, the main body of the paper does not specify any datasets used or provide concrete access information for them.
Dataset Splits	No	The paper mentions conducting experiments in Appendix D, but the main text does not provide specific details on training, validation, or test dataset splits.
Hardware Specification	No	The paper does not provide any specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	No	The paper mentions conducting experiments in Appendix D, but the main text does not provide specific experimental setup details such as hyperparameter values, training configurations, or system-level settings.