Reward Prediction Error as an Exploration Objective in Deep RL

Authors: Riley Simmons-Edler, Ben Eisner, Daniel Yang, Anthony Bisulco, Eric Mitchell, Sebastian Seung, Daniel Lee

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the exploration behavior of QXplore on several Open AI Gym Mu Jo Co tasks and Atari games and observe that QXplore is comparable to or better than a baseline state-novelty method in all cases, outperforming the baseline on tasks where state novelty is not well-correlated with improved reward. ... We describe here the results of experiments to demonstrate the effectiveness of QXplore on continuous control and Atari benchmark tasks.
Researcher Affiliation Collaboration 1Princeton University 2Samsung AI Center NYC 3Stanford University
Pseudocode Yes Our full method is described for the continuous-action domain in Algorithm 1 and a schematic of the method is shown in Figure 1.
Open Source Code No The paper does not include any explicit statement about providing open-source code for the described methodology, nor does it provide a direct link to a code repository.
Open Datasets Yes We benchmark on five continuous control tasks using the Mu Jo Co physics simulator... Fetch Push, Fetch Slide and Fetch Pick And Place, originally proposed in HER [Andrychowicz et al., 2017]... we also evaluated QXplore on several games in the Atari Arcade Learning Environment [Bellemare et al., 2013] to verify that QXplore extends to tasks with image observations and discrete action spaces.
Dataset Splits No The paper states: 'For all experiments, we set the data sampling ratios of Qθ and Qx, RQ and RQx respectively, at 0.75...' This refers to sampling ratios during training, not explicit training/validation/test dataset splits. For RL environments, explicit dataset splits in the conventional supervised learning sense are not typically provided, and the paper does not describe them.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. It generally refers to 'deep RL settings'.
Software Dependencies No The paper mentions software components like 'TD3/dueling double deep Q-networks' and 'Dopamine implementation of DQN', but it does not specify version numbers for these software dependencies, which are necessary for full reproducibility.
Experiment Setup Yes For all experiments, we set the data sampling ratios of Qθ and Qx, RQ and RQx respectively, at 0.75, the best ratio among a sweep of ratios 0.0, 0.25, 0.5, and 0.75 on Sparse Half Cheetah. For continuous control tasks, we used a learning rate of 0.0001 for both Q-functions, the best among all paired combinations of 0.01, 0.001, and 0.0001, and fully-connected networks of two hidden layers of 256 neurons to represent each Q-function, with no shared parameters. For Atari benchmark tasks, we used the dueling double deep Q-networks architecture and hyperparameters described by Wang et al..