Policy Tree: Adaptive Representation for Policy Gradient
Authors: Ujjwal Das Gupta, Erik Talvitie, Michael Bowling
AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that this algorithm can choose genuinely helpful splits and significantly improve upon the commonly used linear Gibbs softmax policy, which we choose as our base policy. |
| Researcher Affiliation | Academia | Ujjwal Das Gupta Department of Computing Science University of Alberta ujjwal@ualberta.ca Erik Talvitie Department of Computer Science Franklin & Marshall College erik.talvitie@fandm.edu Michael Bowling Department of Computing Science University of Alberta mbowling@ualberta.ca |
| Pseudocode | No | The paper provides a 'high level procedure' as a bulleted list, but it is not formatted as pseudocode or a clearly labeled algorithm block. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | Our test suite is a set of 4 simple games inspired by arcade games, which have a 16x16 pixel game screen with 4 colours. |
| Dataset Splits | No | For Policy Tree, we used a parameter optimization stage of 49000 episodes, and a tree growth stage of 1000 episodes. Splits are therefore made after every 50000 episodes. For each algorithm, we performed 10 different runs of learning over 500000 episodes. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running its experiments. |
| Software Dependencies | No | We used the GPOMDP/Policy Gradient algorithm with optimal baseline (Peters and Schaal 2006) to measure the policy gradient in all of the algorithms. |
| Experiment Setup | Yes | We use the linear Gibbs softmax policy as the base policy for our algorithm, augmented with ϵ-greedy exploration... The value of ϵ used was 0.001. For the tree growth criterion, we choose the q = 1 norm of the gradient. A stopping criterion parameter of λ = 1 was used. For each algorithm, we performed 10 different runs of learning over 500000 episodes. |