Reinforcement Learning with Perturbed Rewards

Authors: Jingkang Wang, Yang Liu, Bo Li6202-6209

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on different DRL platforms show that trained policies based on our estimated surrogate reward can achieve higher expected rewards, and converge faster than existing baselines. For instance, the state-of-the-art PPO algorithm is able to obtain 84.6% and 80.8% improvements on average score for five Atari games, with error rates as 10% and 30% respectively.
Researcher Affiliation Academia Jingkang Wang University of Toronto & Vector Institute Toronto, Canada wangjk@cs.toronto.edu Yang Liu University of California, Santa Cruz California, USA yangliu@ucsc.edu Bo Li University of Illinois, Urbana Champaign Illinois, USA lbo@illinois.edu
Pseudocode Yes Algorithm 1 Reward Robust RL (sketch)
Open Source Code No The paper provides a link to the arXiv version of the paper itself (https://arxiv.org/abs/1810.01032), but does not explicitly state or link to any open-source code for the methodology described.
Open Datasets Yes Extensive experiments on Open AI Gym (Brockman et al. 2016) and show that the proposed reward robust RL method achieves comparable performance with the policy trained using the true rewards.
Dataset Splits No The paper describes the environments used (Cart Pole, Pendulum, Atari games) but does not specify training, validation, or test dataset splits in percentages or counts for reproducibility. RL environments are typically interactive, not static datasets with traditional splits.
Hardware Specification No The paper does not provide specific details about the hardware used to run its experiments (e.g., GPU/CPU models, memory, or cloud instance types).
Software Dependencies No The paper mentions using specific RL algorithms (e.g., Q-Learning, CEM, SARSA, DQN, DDQN, DDPG, NAF, PPO) but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes For simplicity, we firstly discretized ( 17, 0] into 17 intervals: ( 17, 16], ( 16, 15], , ( 1, 0], with its value approximated using its maximum point. [...] Romoff et al., we adopted sample mean as a simple approximator during the training and set sequence length as 100.