Reinforcement Learning with a Corrupted Reward Channel
Authors: Tom Everitt, Victoria Krakovna, Laurent Orseau, Shane Legg
IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, the results are illustrated with some simple experiments (Section 6). We illustrate the theoretical results with some simple experiments on a gridworld containing some goal tiles with true reward 0.9 (indicated by yellow circles) and a corrupt reward tile with observed reward 1 and true reward 0 (indicated by a blue square). Average observed and true rewards are shown in Figure 3. |
| Researcher Affiliation | Collaboration | Tom Everitt Australian National University tom4everitt@gmail.com Victoria Krakovna Deep Mind vkrakovna@google.com Laurent Orseau Deep Mind lorseau@google.com Shane Legg Deep Mind legg@google.com 0 Marcus Hutter (ANU) should be recognised as fourth author. |
| Pseudocode | No | The paper describes conceptual algorithms like Q-learning, softmax, and quantilising agents, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement about open-sourcing the code for the methodology described, nor does it include a link to a code repository. It mentions using the AIXIjs framework, but not releasing their own implementation. |
| Open Datasets | No | The paper describes experiments conducted 'on a gridworld containing some goal tiles'. This appears to be a custom-built environment, and no concrete access information (link, DOI, formal citation, or repository) for a publicly available dataset is provided. |
| Dataset Splits | No | The paper does not provide specific dataset split information (percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, or testing. |
| Hardware Specification | No | The paper states that the implementation was done in the 'AIXIjs framework for reinforcement learning', but it does not provide any specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions implementation in the 'AIXIjs framework', but it does not specify any version numbers for this framework or any other software dependencies needed to replicate the experiment. |
| Experiment Setup | Yes | The discounting factor is γ = 0.9. We run Q-learning with ϵ-greedy (ϵ = 0.1), softmax with temperature β = 2, and the quantilising agent with δ = 0.2, 0.5, 0.8 (where 0.8 = 1 p q/|S| = 1 p 1/25) for 100 runs with 1 million cycles. |