Provably Efficient Black-Box Action Poisoning Attacks Against Reinforcement Learning
Authors: Guanlin Liu, Lifeng LAI
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we empirically evaluate the performance of LCB-H attacks against three efficient RL agents, namely UCB-H [Jin et al., 2018], UCB-B [Jin et al., 2018] and UCBVI-CH [Azar et al., 2017], respectively. We perform numerical simulations on an environment represented as an MDP with ten states and five actions, i.e. S = 10 and A = 5. |
| Researcher Affiliation | Academia | Guanlin Liu, Lifeng Lai Department of Electrical and Computer Engineering University of California, Davis {glnliu,lflai}@ucdavis.edu |
| Pseudocode | Yes | Algorithm 1: LCB-H attack strategy on RL algorithm |
| Open Source Code | No | The paper does not provide any statement or link indicating the public availability of the source code for the described methodology. |
| Open Datasets | No | We perform numerical simulations on an environment represented as an MDP with ten states and five actions, i.e. S = 10 and A = 5. The environment is a periodic 1-d grid world... By randomly generating p(s, a) with 0.5 < p(s, a) < 1, we randomly generate the transition probabilities P(s |s, a) for all action a and state s. The mean reward of state-action pairs are randomly generated from a set of values {0.2, 0.35, 0.5, 0.65, 0.8}. |
| Dataset Splits | No | The paper describes simulation over episodes (K) and steps (H) but does not provide specific training, validation, or test dataset splits in the traditional sense of supervised learning. |
| Hardware Specification | Yes | Each of the individual experimental runs costs about twenty hours on one physical CPU core. The type of CPU is Intel Core i7-8700. |
| Software Dependencies | No | The paper mentions specific RL algorithms (UCB-H, UCB-B, UCBVI-CH) but does not provide specific software dependencies or version numbers for any libraries or programming languages used. |
| Experiment Setup | Yes | In this paper, we assume the rewards are bounded by [0, 1]. Thus, we use Bernoulli distribution to randomize the reward signal. The target policy is randomly chosen by deleting the worst action, so as to satisfy Assumption 1. We set the total number of steps H = 10 and the total number of episodes K = 109. |