Reinforcement Learning from Demonstration through Shaping

Authors: Tim Brys, Anna Harutyunyan, Halit Bener Suay, Sonia Chernova, Matthew E. Taylor, Ann Nowé

IJCAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To demonstrate the potential usefulness of the approach proposed in this work, we perform experiments in two domains: Cart Pole and Mario. We compare several approaches, to cover the whole spectrum between RL and Lf D: RL (Q(λ)-learning) RLf D (Q(λ)-learning+shaping) RLf D (Q(λ)-learning+HAT) Lf D (C4.5 decision tree classifier [Quinlan, 1993])
Researcher Affiliation Academia Tim Brys and Anna Harutyunyan Vrije Universiteit Brussel {timbrys, aharutyu}@vub.ac.be Halit Bener Suay and Sonia Chernova Worcester Polytechnic Institute {benersuay, soniac}@wpi.edu Matthew E. Taylor Washington State University taylorm@eecs.wsu.edu Ann Now e Vrije Universiteit Brussel anowe@vub.ac.be
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper.
Open Datasets Yes To demonstrate the potential usefulness of the approach proposed in this work, we perform experiments in two domains: Cart Pole and Mario. ... Cart Pole [Michie and Chambers, 1968] is a task in which the agent controls a cart with a pole on top. ... The Mario benchmark problem [Karakovskiy and Togelius, 2012] is a public reimplementation of the original Super Mario Bros R game.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. It mentions the number of trials and episodes but not data splits.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes Parameters are α = 0.25/16, γ = 1, ϵ = 0.05, λ = 0.25, with 16 10 10 10 10 tilings. For the shaping component, σ = 0.2, and we both initialize and shape with the potential function. With HAT, parameters are B = 1, meaning that the Q-values of the actions suggested by the Lf D policy are initialized to 1 (and the others to 0), and C = 0, i.e. the Lf D policy is not exclusively executed during the initial phases of learning.