Expressing Arbitrary Reward Functions as Potential-Based Advice

Authors: Anna Harutyunyan, Sam Devlin, Peter Vrancx, Ann Nowe

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that advice provided in this way captures the input reward function in expectation, and demonstrate its efficacy empirically. and Experiments We first demonstrate our method correctly solving a gridworld task, as a simplified instance of the bicycle problem. We then assess the practical utility of our framework on a larger cart-pole benchmark, and show that our dynamic (PB) value-function advice outperforms other reward-shaping methods that encode the same knowledge, as well as a popular static shaping w.r.t. a different heuristic.
Researcher Affiliation Academia Anna Harutyunyan Vrije Universiteit Brussel aharutyu@vub.ac.be Sam Devlin University of York sam.devlin@york.ac.uk Peter Vrancx Vrije Universiteit Brussel pvrancx@vub.ac.be Ann Now e Vrije Universiteit Brussel anowe@vub.ac.be
Pseudocode No The paper describes algorithms using mathematical equations and textual descriptions, but it does not include a formal pseudocode block or algorithm listing.
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We now evaluate our approach on a more difficult cart-pole benchmark (Michie and Chambers 1968).
Dataset Splits No The paper describes the learning process and parameter tuning, but it does not specify explicit training, validation, and test dataset splits with percentages or sample counts.
Hardware Specification No The paper does not provide any specific details regarding the hardware used to run the experiments.
Software Dependencies No The paper describes the algorithms and learning methods used (e.g., Sarsa, tile coding) but does not list any specific software dependencies or their version numbers.
Experiment Setup Yes The learning parameters were tuned to the following values: γ = 0.99, c = 1, αt+1 = ταt decaying exponentially (so as to satisfy the condition in Eq. (21)), with α0 = 0.05, τ = 0.999 and βt = 0.1. and The learning parameters were tuned to the following: λ = 0.9, c = 0.1, αt+1 = ταt decaying exponentially (so as to satisfy the condition in Eq. (21)), with α0 = 0.05, τ = 0.999, and βt = 0.2. We found γ to affect the results differently across variants, with the following best values: γ1 = 0.8, γ2 = γ3 = γ4 = 0.99, γ5 = 0.4.