reproducibilityindex.ai

Policy Tree: Adaptive Representation for Policy Gradient

Authors: Ujjwal Das Gupta, Erik Talvitie, Michael Bowling

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that this algorithm can choose genuinely helpful splits and signiﬁcantly improve upon the commonly used linear Gibbs softmax policy, which we choose as our base policy.
Researcher Affiliation	Academia	Ujjwal Das Gupta Department of Computing Science University of Alberta ujjwal@ualberta.ca Erik Talvitie Department of Computer Science Franklin & Marshall College erik.talvitie@fandm.edu Michael Bowling Department of Computing Science University of Alberta mbowling@ualberta.ca
Pseudocode	No	The paper provides a 'high level procedure' as a bulleted list, but it is not formatted as pseudocode or a clearly labeled algorithm block.
Open Source Code	No	The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	No	Our test suite is a set of 4 simple games inspired by arcade games, which have a 16x16 pixel game screen with 4 colours.
Dataset Splits	No	For Policy Tree, we used a parameter optimization stage of 49000 episodes, and a tree growth stage of 1000 episodes. Splits are therefore made after every 50000 episodes. For each algorithm, we performed 10 different runs of learning over 500000 episodes.
Hardware Specification	No	The paper does not provide any specific details about the hardware used for running its experiments.
Software Dependencies	No	We used the GPOMDP/Policy Gradient algorithm with optimal baseline (Peters and Schaal 2006) to measure the policy gradient in all of the algorithms.
Experiment Setup	Yes	We use the linear Gibbs softmax policy as the base policy for our algorithm, augmented with ϵ-greedy exploration... The value of ϵ used was 0.001. For the tree growth criterion, we choose the q = 1 norm of the gradient. A stopping criterion parameter of λ = 1 was used. For each algorithm, we performed 10 different runs of learning over 500000 episodes.