Pitfalls of Learning a Reward Function Online
Authors: Stuart Armstrong, Jan Leike, Laurent Orseau, Shane Legg
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6 Experiments Here we will experimentally contrast a riggable agent, an influenceable (but unriggable) agent, and an uninfluenceable agent. This will illustrate pathological behaviours of influenceable/riggable agents: learning the wrong thing, choosing to learn when they already know, and just refusing to learn. |
| Researcher Affiliation | Collaboration | Stuart Armstrong1,2 , Jan Leike3 , Laurent Orseau3 and Shane Legg3 1Future of Humanity Institute, Oxford University, UK 2Machine Intelligence Research Institute, Berkeley, USA 3Deep Mind, London, UK |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access to source code, such as a specific repository link, an explicit code release statement, or code in supplementary materials, for the methodology described in this paper. |
| Open Datasets | No | The paper describes a custom '4 x 3 gridworld' environment for its experiments but does not provide concrete access information (link, DOI, repository name, formal citation) for a publicly available or open dataset. |
| Dataset Splits | No | The paper describes the Q-learning setup and number of runs but does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'Q-learning' but does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment. |
| Experiment Setup | Yes | The exploration rate is ϵ = 0.1, the agent is run for at most ten steps each episode. We use a learning rate of 1/n. Here n is not the number of episodes the agent has taken, but the number of times the agent has been in state s and taken action a so far hence the number of times it has updated Q(s, a). So each Q-value has a different n. For each setup, we run Q-learning 1000 times. |