Efficient Average Reward Reinforcement Learning Using Constant Shifting Values
Authors: Shangdong Yang, Yang Gao, Bo An, Hao Wang, Xingguo Chen
AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on representative MDPs and the popular game Tetris show that the proposed algorithms significantly outperform the state-of-the-art ones. In this section, the proposed CSV-LEARNING algorithms are evaluated in two average reward MDPs used by Schwartz and Osband (Schwartz 1993; Osband, Russo, and van Roy 2013) (Figure 1(a) & 1(b)), as well as Tetris (Figure 1(c)). |
| Researcher Affiliation | Academia | State Key Laboratory for Novel Software Technology, Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing University, Nanjing 210023, China School of Computer Engineering, Nanyang Technological University, 639798 Singapore School of Computer Science and Technology, School of Software, Nanjing University of Posts and Telecommunications, Nanjing 210023, China |
| Pseudocode | Yes | Algorithm 1 presents CSV-LEARNING. Algorithm 2 describes our extension of CSV-LEARNING using linear FA. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing the source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | In this section, the proposed CSV-LEARNING algorithms are evaluated in two average reward MDPs used by Schwartz and Osband (Schwartz 1993; Osband, Russo, and van Roy 2013) (Figure 1(a) & 1(b)), as well as Tetris (Figure 1(c)). |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning into train, validation, and test sets. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions algorithms like TD(λ), GTD, and TDC, but does not provide specific software names with version numbers or library dependencies used for the experiments. |
| Experiment Setup | Yes | In MDP 1(a), the learning rates α and β of all model-free algorithms were both 0.1 and exploration was executed by a fixed ϵ-greedy policy with ϵ = 0.1. The reference of RVI Q-learning was State 1. In our method, the CSV was set to 4. In MDP 1(b), the following settings worked the best: learning rates α = β = 0.01, and the reference of RVI Q-learning was the starting state. The CSV was set to 0.2 in our method. In both MDPs, we implemented UCRL2 with δ = 0.05. We used the learning rate α = 0.1 for our method, α = 0.1 and β = 0.1 for R-learning, α = 0.5 and λ = 0.1 for TD(λ), α = 0.1 and β = 0.6 for TDC, and α = 0.1 and β = 0.9 for GTD. The parameters above all performed the best via tuning. |