Learning and Planning in Average-Reward Markov Decision Processes
Authors: Yi Wan, Abhishek Naik, Richard S Sutton
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments with an Access-Control Queuing Task, we show some of the difficulties that can arise when using methods that rely on reference states and argue that our new algorithms can be significantly easier to use. |
| Researcher Affiliation | Collaboration | 1University of Alberta and Alberta Machine Intelligence Institute (Amii), Edmonton, Canada. 2Deep Mind. |
| Pseudocode | Yes | pseudocodes for both algorithms are in Appendix A |
| Open Source Code | No | The paper does not provide any links or explicit statements about the public availability of its source code. |
| Open Datasets | Yes | In this section we present empirical results with both Differential Q-learning and RVI Q-learning algorithms on the Access-Control Queuing task (Sutton & Barto 2018). In this section we present empirical results with average-reward prediction learning algorithms using the Two Loop task shown in the upper right of Figure 3 (cf. Mahadevan 1996, Naik et al. 2019). |
| Dataset Splits | No | The paper describes experiments run for a certain number of steps (e.g., "30 runs of 80,000 steps") and parameter studies, but does not specify explicit training/validation/test dataset splits as commonly found in supervised learning tasks. Data is generated through interaction with an environment. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not specify the version numbers of any software libraries, programming languages, or solvers used in the experiments. |
| Experiment Setup | Yes | Differential Q-learning was run with a range of η values, and RVI Q-learning was run with three kinds of reference functions suggested by Abounadi et al. (2001): (1) the value of a single reference state action pair, for which we considered all possible 88 state action pairs, (2) the maximum value of the action-value estimates, and (3) the mean of the action-value estimates. Both algorithms used an ϵ-greedy behavior policy with ϵ = 0.1. The policy π to be evaluated was the one that randomly picks left or right in state 0 with probability 0.5. data collected with a behavior policy that picks the left and right actions with probabilities 0.9 and 0.1 respectively. |