reproducibilityindex.ai

Reinforcement Learning with Perturbed Rewards

Authors: Jingkang Wang, Yang Liu, Bo Li6202-6209

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on different DRL platforms show that trained policies based on our estimated surrogate reward can achieve higher expected rewards, and converge faster than existing baselines. For instance, the state-of-the-art PPO algorithm is able to obtain 84.6% and 80.8% improvements on average score for ﬁve Atari games, with error rates as 10% and 30% respectively.
Researcher Affiliation	Academia	Jingkang Wang University of Toronto & Vector Institute Toronto, Canada wangjk@cs.toronto.edu Yang Liu University of California, Santa Cruz California, USA yangliu@ucsc.edu Bo Li University of Illinois, Urbana Champaign Illinois, USA lbo@illinois.edu
Pseudocode	Yes	Algorithm 1 Reward Robust RL (sketch)
Open Source Code	No	The paper provides a link to the arXiv version of the paper itself (https://arxiv.org/abs/1810.01032), but does not explicitly state or link to any open-source code for the methodology described.
Open Datasets	Yes	Extensive experiments on Open AI Gym (Brockman et al. 2016) and show that the proposed reward robust RL method achieves comparable performance with the policy trained using the true rewards.
Dataset Splits	No	The paper describes the environments used (Cart Pole, Pendulum, Atari games) but does not specify training, validation, or test dataset splits in percentages or counts for reproducibility. RL environments are typically interactive, not static datasets with traditional splits.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run its experiments (e.g., GPU/CPU models, memory, or cloud instance types).
Software Dependencies	No	The paper mentions using specific RL algorithms (e.g., Q-Learning, CEM, SARSA, DQN, DDQN, DDPG, NAF, PPO) but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup	Yes	For simplicity, we ﬁrstly discretized ( 17, 0] into 17 intervals: ( 17, 16], ( 16, 15], , ( 1, 0], with its value approximated using its maximum point. [...] Romoff et al., we adopted sample mean as a simple approximator during the training and set sequence length as 100.