reproducibilityindex.ai

Provably Efficient Black-Box Action Poisoning Attacks Against Reinforcement Learning

Authors: Guanlin Liu, Lifeng LAI

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we empirically evaluate the performance of LCB-H attacks against three efﬁcient RL agents, namely UCB-H [Jin et al., 2018], UCB-B [Jin et al., 2018] and UCBVI-CH [Azar et al., 2017], respectively. We perform numerical simulations on an environment represented as an MDP with ten states and ﬁve actions, i.e. S = 10 and A = 5.
Researcher Affiliation	Academia	Guanlin Liu, Lifeng Lai Department of Electrical and Computer Engineering University of California, Davis {glnliu,lflai}@ucdavis.edu
Pseudocode	Yes	Algorithm 1: LCB-H attack strategy on RL algorithm
Open Source Code	No	The paper does not provide any statement or link indicating the public availability of the source code for the described methodology.
Open Datasets	No	We perform numerical simulations on an environment represented as an MDP with ten states and ﬁve actions, i.e. S = 10 and A = 5. The environment is a periodic 1-d grid world... By randomly generating p(s, a) with 0.5 < p(s, a) < 1, we randomly generate the transition probabilities P(s \|s, a) for all action a and state s. The mean reward of state-action pairs are randomly generated from a set of values {0.2, 0.35, 0.5, 0.65, 0.8}.
Dataset Splits	No	The paper describes simulation over episodes (K) and steps (H) but does not provide specific training, validation, or test dataset splits in the traditional sense of supervised learning.
Hardware Specification	Yes	Each of the individual experimental runs costs about twenty hours on one physical CPU core. The type of CPU is Intel Core i7-8700.
Software Dependencies	No	The paper mentions specific RL algorithms (UCB-H, UCB-B, UCBVI-CH) but does not provide specific software dependencies or version numbers for any libraries or programming languages used.
Experiment Setup	Yes	In this paper, we assume the rewards are bounded by [0, 1]. Thus, we use Bernoulli distribution to randomize the reward signal. The target policy is randomly chosen by deleting the worst action, so as to satisfy Assumption 1. We set the total number of steps H = 10 and the total number of episodes K = 109.