Reinforcement Learning with Parameterized Actions

Authors: Warwick Masson, Pravesh Ranchod, George Konidaris

AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce the Q-PAMDP algorithm for learning in these domains, show that it converges to a local optimum, and compare it to direct policy search in the goalscoring and Platform domains.
Researcher Affiliation Academia Warwick Masson and Pravesh Ranchod School of Computer Science and Applied Mathematics University of Witwatersrand Johannesburg, South Africa warwick.masson@students.wits.ac.za pravesh.ranchod@wits.ac.za George Konidaris Department of Computer Science Duke University Durham, North Carolina 27708 gdk@cs.duke.edu
Pseudocode Yes Algorithm 1 Q-PAMDP(k)
Open Source Code No No explicit statement or link regarding the public availability of source code for the described methodology was found.
Open Datasets No The paper describes experiments in the 'goalscoring' and 'Platform' domains, which appear to be simulation environments set up by the authors rather than pre-existing public datasets with explicit access information. It references 'Kitano et al. 1997' for the robot soccer problem, but this is a problem description, not a dataset citation with access details.
Dataset Splits No The paper does not provide specific train/validation/test dataset splits, percentages, or sample counts. It mentions 'averaged over 20 runs' for evaluation, but not data partitioning.
Hardware Specification No No specific hardware details (e.g., GPU models, CPU types, or memory) used for running the experiments were provided in the paper.
Software Dependencies No The paper mentions algorithms like 'gradient-descent Sarsa(λ)' and 'e NAC' but does not provide specific software or library names with version numbers (e.g., Python 3.x, PyTorch 1.x) that are required to reproduce the experiments.
Experiment Setup Yes At each step we perform one e NAC update based on 50 episodes and then refit Qω using 50 gradient descent Sarsa(λ) episodes.