Reinforcement Learning with a Corrupted Reward Channel

Authors: Tom Everitt, Victoria Krakovna, Laurent Orseau, Shane Legg

IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, the results are illustrated with some simple experiments (Section 6). We illustrate the theoretical results with some simple experiments on a gridworld containing some goal tiles with true reward 0.9 (indicated by yellow circles) and a corrupt reward tile with observed reward 1 and true reward 0 (indicated by a blue square). Average observed and true rewards are shown in Figure 3.
Researcher Affiliation Collaboration Tom Everitt Australian National University tom4everitt@gmail.com Victoria Krakovna Deep Mind vkrakovna@google.com Laurent Orseau Deep Mind lorseau@google.com Shane Legg Deep Mind legg@google.com 0 Marcus Hutter (ANU) should be recognised as fourth author.
Pseudocode No The paper describes conceptual algorithms like Q-learning, softmax, and quantilising agents, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement about open-sourcing the code for the methodology described, nor does it include a link to a code repository. It mentions using the AIXIjs framework, but not releasing their own implementation.
Open Datasets No The paper describes experiments conducted 'on a gridworld containing some goal tiles'. This appears to be a custom-built environment, and no concrete access information (link, DOI, formal citation, or repository) for a publicly available dataset is provided.
Dataset Splits No The paper does not provide specific dataset split information (percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, or testing.
Hardware Specification No The paper states that the implementation was done in the 'AIXIjs framework for reinforcement learning', but it does not provide any specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions implementation in the 'AIXIjs framework', but it does not specify any version numbers for this framework or any other software dependencies needed to replicate the experiment.
Experiment Setup Yes The discounting factor is γ = 0.9. We run Q-learning with ϵ-greedy (ϵ = 0.1), softmax with temperature β = 2, and the quantilising agent with δ = 0.2, 0.5, 0.8 (where 0.8 = 1 p q/|S| = 1 p 1/25) for 100 runs with 1 million cycles.