reproducibilityindex.ai

Expressing Arbitrary Reward Functions as Potential-Based Advice

Authors: Anna Harutyunyan, Sam Devlin, Peter Vrancx, Ann Nowe

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that advice provided in this way captures the input reward function in expectation, and demonstrate its efﬁcacy empirically. and Experiments We ﬁrst demonstrate our method correctly solving a gridworld task, as a simpliﬁed instance of the bicycle problem. We then assess the practical utility of our framework on a larger cart-pole benchmark, and show that our dynamic (PB) value-function advice outperforms other reward-shaping methods that encode the same knowledge, as well as a popular static shaping w.r.t. a different heuristic.
Researcher Affiliation	Academia	Anna Harutyunyan Vrije Universiteit Brussel aharutyu@vub.ac.be Sam Devlin University of York sam.devlin@york.ac.uk Peter Vrancx Vrije Universiteit Brussel pvrancx@vub.ac.be Ann Now e Vrije Universiteit Brussel anowe@vub.ac.be
Pseudocode	No	The paper describes algorithms using mathematical equations and textual descriptions, but it does not include a formal pseudocode block or algorithm listing.
Open Source Code	No	The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	We now evaluate our approach on a more difﬁcult cart-pole benchmark (Michie and Chambers 1968).
Dataset Splits	No	The paper describes the learning process and parameter tuning, but it does not specify explicit training, validation, and test dataset splits with percentages or sample counts.
Hardware Specification	No	The paper does not provide any specific details regarding the hardware used to run the experiments.
Software Dependencies	No	The paper describes the algorithms and learning methods used (e.g., Sarsa, tile coding) but does not list any specific software dependencies or their version numbers.
Experiment Setup	Yes	The learning parameters were tuned to the following values: γ = 0.99, c = 1, αt+1 = ταt decaying exponentially (so as to satisfy the condition in Eq. (21)), with α0 = 0.05, τ = 0.999 and βt = 0.1. and The learning parameters were tuned to the following: λ = 0.9, c = 0.1, αt+1 = ταt decaying exponentially (so as to satisfy the condition in Eq. (21)), with α0 = 0.05, τ = 0.999, and βt = 0.2. We found γ to affect the results differently across variants, with the following best values: γ1 = 0.8, γ2 = γ3 = γ4 = 0.99, γ5 = 0.4.