Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Learning and Planning in Average-Reward Markov Decision Processes
Authors: Yi Wan, Abhishek Naik, Richard S Sutton
ICML 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments with an Access-Control Queuing Task, we show some of the difficulties that can arise when using methods that rely on reference states and argue that our new algorithms can be significantly easier to use. |
| Researcher Affiliation | Collaboration | 1University of Alberta and Alberta Machine Intelligence Institute (Amii), Edmonton, Canada. 2Deep Mind. |
| Pseudocode | Yes | pseudocodes for both algorithms are in Appendix A |
| Open Source Code | No | The paper does not provide any links or explicit statements about the public availability of its source code. |
| Open Datasets | Yes | In this section we present empirical results with both Differential Q-learning and RVI Q-learning algorithms on the Access-Control Queuing task (Sutton & Barto 2018). In this section we present empirical results with average-reward prediction learning algorithms using the Two Loop task shown in the upper right of Figure 3 (cf. Mahadevan 1996, Naik et al. 2019). |
| Dataset Splits | No | The paper describes experiments run for a certain number of steps (e.g., "30 runs of 80,000 steps") and parameter studies, but does not specify explicit training/validation/test dataset splits as commonly found in supervised learning tasks. Data is generated through interaction with an environment. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not specify the version numbers of any software libraries, programming languages, or solvers used in the experiments. |
| Experiment Setup | Yes | Differential Q-learning was run with a range of η values, and RVI Q-learning was run with three kinds of reference functions suggested by Abounadi et al. (2001): (1) the value of a single reference state action pair, for which we considered all possible 88 state action pairs, (2) the maximum value of the action-value estimates, and (3) the mean of the action-value estimates. Both algorithms used an ϵ-greedy behavior policy with ϵ = 0.1. The policy π to be evaluated was the one that randomly picks left or right in state 0 with probability 0.5. data collected with a behavior policy that picks the left and right actions with probabilities 0.9 and 0.1 respectively. |