Potential Based Reward Shaping for Hierarchical Reinforcement Learning

Authors: Yang Gao, Francesca Toni

IJCAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results show that PBRS-MAXQ-0 significantly outperforms MAXQ-0 given good heuristics, and can converge even when given misleading heuristics. ... We implement MAXQ-0 and PBRS-MAXQ-0 in two widely used applications for MAXQ: the Fickle Taxi problem and the Resource Collection problem, to compare their performances.
Researcher Affiliation Academia Yang Gao, Francesca Toni Department of Computing, Imperial College London {y.gao11,f.toni}@imperial.ac.uk
Pseudocode Yes Algorithm 1 The PBRS-MAXQ-0 algorithm.
Open Source Code No No statement explicitly providing open-source code for the methodology described in this paper was found. Footnote 3 links to supplementary material for proofs, not code.
Open Datasets No The paper describes the 'Fickle Taxi problem' and 'Resource Collection problem' as testbeds, which are environments, not specific publicly available datasets with access information (link, DOI, or formal citation). No concrete access information for a public dataset was provided.
Dataset Splits No The paper does not provide specific details on train/validation/test dataset splits. It describes learning parameters for episodes in reinforcement learning environments, which is a different concept from dataset splits for supervised learning.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory, or detailed computer specifications) used for running the experiments are mentioned in the paper.
Software Dependencies No No specific software dependencies with version numbers are provided. The paper does not list the versions of any programming languages, libraries, or frameworks used for implementation.
Experiment Setup Yes The initial values and the decreasing rates (in brackets) of α and ϵ are listed in Table 1. ... In all experiments and for all algorithms, we have γ = 1. ... The learning parameters used in each algorithm are listed in Table 2, and they are selected to maximise the convergence speed of each algorithm.