reproducibilityindex.ai

Efficient Average Reward Reinforcement Learning Using Constant Shifting Values

Authors: Shangdong Yang, Yang Gao, Bo An, Hao Wang, Xingguo Chen

AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on representative MDPs and the popular game Tetris show that the proposed algorithms signiﬁcantly outperform the state-of-the-art ones. In this section, the proposed CSV-LEARNING algorithms are evaluated in two average reward MDPs used by Schwartz and Osband (Schwartz 1993; Osband, Russo, and van Roy 2013) (Figure 1(a) & 1(b)), as well as Tetris (Figure 1(c)).
Researcher Affiliation	Academia	State Key Laboratory for Novel Software Technology, Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing University, Nanjing 210023, China School of Computer Engineering, Nanyang Technological University, 639798 Singapore School of Computer Science and Technology, School of Software, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
Pseudocode	Yes	Algorithm 1 presents CSV-LEARNING. Algorithm 2 describes our extension of CSV-LEARNING using linear FA.
Open Source Code	No	The paper does not provide an explicit statement about releasing the source code for the described methodology or a link to a code repository.
Open Datasets	Yes	In this section, the proposed CSV-LEARNING algorithms are evaluated in two average reward MDPs used by Schwartz and Osband (Schwartz 1993; Osband, Russo, and van Roy 2013) (Figure 1(a) & 1(b)), as well as Tetris (Figure 1(c)).
Dataset Splits	No	The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning into train, validation, and test sets.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions algorithms like TD(λ), GTD, and TDC, but does not provide specific software names with version numbers or library dependencies used for the experiments.
Experiment Setup	Yes	In MDP 1(a), the learning rates α and β of all model-free algorithms were both 0.1 and exploration was executed by a ﬁxed ϵ-greedy policy with ϵ = 0.1. The reference of RVI Q-learning was State 1. In our method, the CSV was set to 4. In MDP 1(b), the following settings worked the best: learning rates α = β = 0.01, and the reference of RVI Q-learning was the starting state. The CSV was set to 0.2 in our method. In both MDPs, we implemented UCRL2 with δ = 0.05. We used the learning rate α = 0.1 for our method, α = 0.1 and β = 0.1 for R-learning, α = 0.5 and λ = 0.1 for TD(λ), α = 0.1 and β = 0.6 for TDC, and α = 0.1 and β = 0.9 for GTD. The parameters above all performed the best via tuning.