Policy Tree: Adaptive Representation for Policy Gradient

Authors: Ujjwal Das Gupta, Erik Talvitie, Michael Bowling

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that this algorithm can choose genuinely helpful splits and significantly improve upon the commonly used linear Gibbs softmax policy, which we choose as our base policy.
Researcher Affiliation Academia Ujjwal Das Gupta Department of Computing Science University of Alberta ujjwal@ualberta.ca Erik Talvitie Department of Computer Science Franklin & Marshall College erik.talvitie@fandm.edu Michael Bowling Department of Computing Science University of Alberta mbowling@ualberta.ca
Pseudocode No The paper provides a 'high level procedure' as a bulleted list, but it is not formatted as pseudocode or a clearly labeled algorithm block.
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No Our test suite is a set of 4 simple games inspired by arcade games, which have a 16x16 pixel game screen with 4 colours.
Dataset Splits No For Policy Tree, we used a parameter optimization stage of 49000 episodes, and a tree growth stage of 1000 episodes. Splits are therefore made after every 50000 episodes. For each algorithm, we performed 10 different runs of learning over 500000 episodes.
Hardware Specification No The paper does not provide any specific details about the hardware used for running its experiments.
Software Dependencies No We used the GPOMDP/Policy Gradient algorithm with optimal baseline (Peters and Schaal 2006) to measure the policy gradient in all of the algorithms.
Experiment Setup Yes We use the linear Gibbs softmax policy as the base policy for our algorithm, augmented with ϵ-greedy exploration... The value of ϵ used was 0.001. For the tree growth criterion, we choose the q = 1 norm of the gradient. A stopping criterion parameter of λ = 1 was used. For each algorithm, we performed 10 different runs of learning over 500000 episodes.