Adaptive Reward-Poisoning Attacks against Reinforcement Learning
Authors: Xuezhou Zhang, Yuzhe Ma, Adish Singla, Xiaojin Zhu
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we formulate that the reward poisoning problem as an optimal control problem on a higher-level attack MDP, and developed computational tools based on DRL that is able to find efficient attack policies across a variety of environments. ... In this section, We make empirical comparisons between a number of attack policies φ: ... In all of our experiments, we assume a standard Q-learning RL agent with parameters: Q0 = 0S A, ε = 0.1, γ = 0.9, αt = 0.9, t. The plots show 1 standard error around each curve (some are difficult to see). |
| Researcher Affiliation | Academia | 1University of Wisconsin-Madison 2Max Planck Institute for Software Systems (MPI-SWS). |
| Pseudocode | Yes | Algorithm 1 Reward Poisoning against Q-learning; Algorithm 2 The Non-Adaptive Attack φsas 3; Algorithm 3 The Fast Adaptive Attack (FAA). |
| Open Source Code | No | The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | The paper uses standard environments like a "chain MDP" and a "Grid World". While these environments are well-defined, the paper does not refer to them as publicly available datasets with specific access information (link, DOI, or formal citation to a dataset repository). |
| Dataset Splits | No | The paper describes experiments within simulated environments (MDPs) for reinforcement learning. It does not mention explicit training/validation/test dataset splits as would typically apply to static datasets in supervised learning. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments (e.g., CPU/GPU models, memory, or cloud instance types). |
| Software Dependencies | No | The paper mentions "Twin Delayed DDPG (TD3)" but does not specify its version number or any other software dependencies with version information. |
| Experiment Setup | Yes | In all of our experiments, we assume a standard Q-learning RL agent with parameters: Q0 = 0S A, ε = 0.1, γ = 0.9, αt = 0.9, t. |