Provably Efficient Black-Box Action Poisoning Attacks Against Reinforcement Learning

Authors: Guanlin Liu, Lifeng LAI

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we empirically evaluate the performance of LCB-H attacks against three efficient RL agents, namely UCB-H [Jin et al., 2018], UCB-B [Jin et al., 2018] and UCBVI-CH [Azar et al., 2017], respectively. We perform numerical simulations on an environment represented as an MDP with ten states and five actions, i.e. S = 10 and A = 5.
Researcher Affiliation Academia Guanlin Liu, Lifeng Lai Department of Electrical and Computer Engineering University of California, Davis {glnliu,lflai}@ucdavis.edu
Pseudocode Yes Algorithm 1: LCB-H attack strategy on RL algorithm
Open Source Code No The paper does not provide any statement or link indicating the public availability of the source code for the described methodology.
Open Datasets No We perform numerical simulations on an environment represented as an MDP with ten states and five actions, i.e. S = 10 and A = 5. The environment is a periodic 1-d grid world... By randomly generating p(s, a) with 0.5 < p(s, a) < 1, we randomly generate the transition probabilities P(s |s, a) for all action a and state s. The mean reward of state-action pairs are randomly generated from a set of values {0.2, 0.35, 0.5, 0.65, 0.8}.
Dataset Splits No The paper describes simulation over episodes (K) and steps (H) but does not provide specific training, validation, or test dataset splits in the traditional sense of supervised learning.
Hardware Specification Yes Each of the individual experimental runs costs about twenty hours on one physical CPU core. The type of CPU is Intel Core i7-8700.
Software Dependencies No The paper mentions specific RL algorithms (UCB-H, UCB-B, UCBVI-CH) but does not provide specific software dependencies or version numbers for any libraries or programming languages used.
Experiment Setup Yes In this paper, we assume the rewards are bounded by [0, 1]. Thus, we use Bernoulli distribution to randomize the reward signal. The target policy is randomly chosen by deleting the worst action, so as to satisfy Assumption 1. We set the total number of steps H = 10 and the total number of episodes K = 109.