reproducibilityindex.ai

Learning and Planning in Average-Reward Markov Decision Processes

Authors: Yi Wan, Abhishek Naik, Richard S Sutton

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments with an Access-Control Queuing Task, we show some of the difﬁculties that can arise when using methods that rely on reference states and argue that our new algorithms can be signiﬁcantly easier to use.
Researcher Affiliation	Collaboration	1University of Alberta and Alberta Machine Intelligence Institute (Amii), Edmonton, Canada. 2Deep Mind.
Pseudocode	Yes	pseudocodes for both algorithms are in Appendix A
Open Source Code	No	The paper does not provide any links or explicit statements about the public availability of its source code.
Open Datasets	Yes	In this section we present empirical results with both Differential Q-learning and RVI Q-learning algorithms on the Access-Control Queuing task (Sutton & Barto 2018). In this section we present empirical results with average-reward prediction learning algorithms using the Two Loop task shown in the upper right of Figure 3 (cf. Mahadevan 1996, Naik et al. 2019).
Dataset Splits	No	The paper describes experiments run for a certain number of steps (e.g., "30 runs of 80,000 steps") and parameter studies, but does not specify explicit training/validation/test dataset splits as commonly found in supervised learning tasks. Data is generated through interaction with an environment.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper does not specify the version numbers of any software libraries, programming languages, or solvers used in the experiments.
Experiment Setup	Yes	Differential Q-learning was run with a range of η values, and RVI Q-learning was run with three kinds of reference functions suggested by Abounadi et al. (2001): (1) the value of a single reference state action pair, for which we considered all possible 88 state action pairs, (2) the maximum value of the action-value estimates, and (3) the mean of the action-value estimates. Both algorithms used an ϵ-greedy behavior policy with ϵ = 0.1. The policy π to be evaluated was the one that randomly picks left or right in state 0 with probability 0.5. data collected with a behavior policy that picks the left and right actions with probabilities 0.9 and 0.1 respectively.