Vulnerability-Aware Poisoning Mechanism for Online RL with Unknown Dynamics

Authors: Yanchao Sun, Da Huo, Furong Huang

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on multiple deep RL agents and multiple environments show that our poisoning algorithm successfully prevents agents from learning a good policy or teaches the agents to converge to a target policy, with a limited attacking budget. (Abstract) and In this section, we evaluate the performance of VA2C-P by poisoning multiple algorithms on various environments. We demonstrate that VA2C-P can effectively reduce the total reward of a training agent, or force the agent to choose a specific policy with limited power and budget. (Section 5)
Researcher Affiliation Academia 1,3 Department of Computer Science, University of Maryland, College Park, MD 20742, USA 2 Shanghai Jiao Tong University, China 1ycs@umd.edu, 2sjtuhuoda@sjtu.edu.cn, 3furongh@umd.edu
Pseudocode Yes Algorithm 1: Vulnerability-Aware Adversarial Critic Poison (Page 5) and Algorithm 3: Non-targeted White-box VA2C-P with Or-Poisoning (Page 13)
Open Source Code Yes In the supplementary materials we provide the code and instructions, as well as demo videos of poisoning A2C in the Hopper environment, where one can see under the same budget constraints, random poisoning has nearly no influence the agent s behaviors, while our proposed VA2C-P successfully prevents the agent from hopping forward. (Section G.1)
Open Datasets Yes And we choose 5 Gym (Brockman et al., 2016) environments with increasing difficulty levels: Cart Pole, Lunar Lander, Hopper, Walker and Half Cheetah. (Section 5)
Dataset Splits No No explicit train/validation/test dataset splits were provided. The paper describes the number of episodes and steps for training (e.g., "We run VPG and PPO for 1000 episodes on every environment" and "learning last for 80000 steps in total") for online RL, where data is generated dynamically.
Hardware Specification No No specific hardware details (e.g., CPU or GPU models, memory) were provided. The paper only mentions using "PyTorch implementations" and running "16 processes".
Software Dependencies No We implement VPG and PPO with Py Torch, and the implementation of A2C and ACKTR are modified from the project by Kostrikov (2018). (Section G.1). Specific version numbers for PyTorch or other libraries are not provided.
Experiment Setup Yes Network Architecture. For all the learners, we use a two-layer policy network with Tanh as the activation function, where each layer has 64 nodes. PPO, A2C and ACKTR also have an additional same-sized critic network. Hyper-parameters. In all experiments, the discount factor γ is set to be 0.99. We run VPG and PPO for 1000 episodes on every environment, and update the policy after every episode. For A2C and ACKTR, we use 16 processes to collect observations simultaneously, and update policy every 5 steps (i.e., each observation O has 80 (s, a, r) tuples); learning last for 80000 steps in total. All results are averaged over 10 random seeds. (Section G.1)