Learning and Planning in Average-Reward Markov Decision Processes

Authors: Yi Wan, Abhishek Naik, Richard S Sutton

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments with an Access-Control Queuing Task, we show some of the difficulties that can arise when using methods that rely on reference states and argue that our new algorithms can be significantly easier to use.
Researcher Affiliation Collaboration 1University of Alberta and Alberta Machine Intelligence Institute (Amii), Edmonton, Canada. 2Deep Mind.
Pseudocode Yes pseudocodes for both algorithms are in Appendix A
Open Source Code No The paper does not provide any links or explicit statements about the public availability of its source code.
Open Datasets Yes In this section we present empirical results with both Differential Q-learning and RVI Q-learning algorithms on the Access-Control Queuing task (Sutton & Barto 2018). In this section we present empirical results with average-reward prediction learning algorithms using the Two Loop task shown in the upper right of Figure 3 (cf. Mahadevan 1996, Naik et al. 2019).
Dataset Splits No The paper describes experiments run for a certain number of steps (e.g., "30 runs of 80,000 steps") and parameter studies, but does not specify explicit training/validation/test dataset splits as commonly found in supervised learning tasks. Data is generated through interaction with an environment.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies No The paper does not specify the version numbers of any software libraries, programming languages, or solvers used in the experiments.
Experiment Setup Yes Differential Q-learning was run with a range of η values, and RVI Q-learning was run with three kinds of reference functions suggested by Abounadi et al. (2001): (1) the value of a single reference state action pair, for which we considered all possible 88 state action pairs, (2) the maximum value of the action-value estimates, and (3) the mean of the action-value estimates. Both algorithms used an ϵ-greedy behavior policy with ϵ = 0.1. The policy π to be evaluated was the one that randomly picks left or right in state 0 with probability 0.5. data collected with a behavior policy that picks the left and right actions with probabilities 0.9 and 0.1 respectively.