Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Policy Tree: Adaptive Representation for Policy Gradient
Authors: Ujjwal Das Gupta, Erik Talvitie, Michael Bowling
AAAI 2015 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that this algorithm can choose genuinely helpful splits and significantly improve upon the commonly used linear Gibbs softmax policy, which we choose as our base policy. |
| Researcher Affiliation | Academia | Ujjwal Das Gupta Department of Computing Science University of Alberta EMAIL Erik Talvitie Department of Computer Science Franklin & Marshall College EMAIL Michael Bowling Department of Computing Science University of Alberta EMAIL |
| Pseudocode | No | The paper provides a 'high level procedure' as a bulleted list, but it is not formatted as pseudocode or a clearly labeled algorithm block. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | Our test suite is a set of 4 simple games inspired by arcade games, which have a 16x16 pixel game screen with 4 colours. |
| Dataset Splits | No | For Policy Tree, we used a parameter optimization stage of 49000 episodes, and a tree growth stage of 1000 episodes. Splits are therefore made after every 50000 episodes. For each algorithm, we performed 10 different runs of learning over 500000 episodes. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running its experiments. |
| Software Dependencies | No | We used the GPOMDP/Policy Gradient algorithm with optimal baseline (Peters and Schaal 2006) to measure the policy gradient in all of the algorithms. |
| Experiment Setup | Yes | We use the linear Gibbs softmax policy as the base policy for our algorithm, augmented with ϵ-greedy exploration... The value of ϵ used was 0.001. For the tree growth criterion, we choose the q = 1 norm of the gradient. A stopping criterion parameter of λ = 1 was used. For each algorithm, we performed 10 different runs of learning over 500000 episodes. |