reproducibilityindex.ai

Adaptive Reward-Poisoning Attacks against Reinforcement Learning

Authors: Xuezhou Zhang, Yuzhe Ma, Adish Singla, Xiaojin Zhu

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we formulate that the reward poisoning problem as an optimal control problem on a higher-level attack MDP, and developed computational tools based on DRL that is able to find efficient attack policies across a variety of environments. ... In this section, We make empirical comparisons between a number of attack policies φ: ... In all of our experiments, we assume a standard Q-learning RL agent with parameters: Q0 = 0S A, ε = 0.1, γ = 0.9, αt = 0.9, t. The plots show 1 standard error around each curve (some are difficult to see).
Researcher Affiliation	Academia	1University of Wisconsin-Madison 2Max Planck Institute for Software Systems (MPI-SWS).
Pseudocode	Yes	Algorithm 1 Reward Poisoning against Q-learning; Algorithm 2 The Non-Adaptive Attack φsas 3; Algorithm 3 The Fast Adaptive Attack (FAA).
Open Source Code	No	The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	No	The paper uses standard environments like a "chain MDP" and a "Grid World". While these environments are well-defined, the paper does not refer to them as publicly available datasets with specific access information (link, DOI, or formal citation to a dataset repository).
Dataset Splits	No	The paper describes experiments within simulated environments (MDPs) for reinforcement learning. It does not mention explicit training/validation/test dataset splits as would typically apply to static datasets in supervised learning.
Hardware Specification	No	The paper does not provide any specific details about the hardware used to run the experiments (e.g., CPU/GPU models, memory, or cloud instance types).
Software Dependencies	No	The paper mentions "Twin Delayed DDPG (TD3)" but does not specify its version number or any other software dependencies with version information.
Experiment Setup	Yes	In all of our experiments, we assume a standard Q-learning RL agent with parameters: Q0 = 0S A, ε = 0.1, γ = 0.9, αt = 0.9, t.