Expressing Arbitrary Reward Functions as Potential-Based Advice
Authors: Anna Harutyunyan, Sam Devlin, Peter Vrancx, Ann Nowe
AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that advice provided in this way captures the input reward function in expectation, and demonstrate its efficacy empirically. and Experiments We first demonstrate our method correctly solving a gridworld task, as a simplified instance of the bicycle problem. We then assess the practical utility of our framework on a larger cart-pole benchmark, and show that our dynamic (PB) value-function advice outperforms other reward-shaping methods that encode the same knowledge, as well as a popular static shaping w.r.t. a different heuristic. |
| Researcher Affiliation | Academia | Anna Harutyunyan Vrije Universiteit Brussel aharutyu@vub.ac.be Sam Devlin University of York sam.devlin@york.ac.uk Peter Vrancx Vrije Universiteit Brussel pvrancx@vub.ac.be Ann Now e Vrije Universiteit Brussel anowe@vub.ac.be |
| Pseudocode | No | The paper describes algorithms using mathematical equations and textual descriptions, but it does not include a formal pseudocode block or algorithm listing. |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We now evaluate our approach on a more difficult cart-pole benchmark (Michie and Chambers 1968). |
| Dataset Splits | No | The paper describes the learning process and parameter tuning, but it does not specify explicit training, validation, and test dataset splits with percentages or sample counts. |
| Hardware Specification | No | The paper does not provide any specific details regarding the hardware used to run the experiments. |
| Software Dependencies | No | The paper describes the algorithms and learning methods used (e.g., Sarsa, tile coding) but does not list any specific software dependencies or their version numbers. |
| Experiment Setup | Yes | The learning parameters were tuned to the following values: γ = 0.99, c = 1, αt+1 = ταt decaying exponentially (so as to satisfy the condition in Eq. (21)), with α0 = 0.05, τ = 0.999 and βt = 0.1. and The learning parameters were tuned to the following: λ = 0.9, c = 0.1, αt+1 = ταt decaying exponentially (so as to satisfy the condition in Eq. (21)), with α0 = 0.05, τ = 0.999, and βt = 0.2. We found γ to affect the results differently across variants, with the following best values: γ1 = 0.8, γ2 = γ3 = γ4 = 0.99, γ5 = 0.4. |