Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Efficient Reward Poisoning Attacks on Online Deep Reinforcement Learning
Authors: Yinglun Xu, Qi Zeng, Gagandeep Singh
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide a theoretical analysis of the efficiency of our attack and perform an extensive empirical evaluation. Our results show that our attacks efficiently poison agents learning in several popular classical control and Mu Jo Co environments with a variety of state-of-the-art DRL algorithms, such as DQN, PPO, SAC, etc. |
| Researcher Affiliation | Academia | Yinglun Xu EMAIL Department of Computer Science University of Illinois at Urbana-Champaign Qi Zeng EMAIL Department of Computer Science University of Illinois at Urbana-Champaign Gagandeep Singh EMAIL Department of Computer Science University of Illinois at Urbana-Champaign |
| Pseudocode | No | The paper describes methods and algorithms using mathematical formulations and descriptive text, but it does not contain any explicitly labeled pseudocode blocks or algorithm figures. |
| Open Source Code | No | The implementation of the algorithms is based on the spinningup project (Achiam, 2018). The paper refers to third-party code used for implementation but does not provide an explicit link or statement for the authors' own source code. |
| Open Datasets | Yes | We consider 4 common Gym environments (Brockman et al., 2016) in the discrete case: Cart Pole, Lunar Lander, Mountain Car, and Acrobot, and 4 continuous cases: Half Cheetah, Hopper, Walker2d, and Swimmer. |
| Dataset Splits | No | The paper discusses training steps (T) and the proportion of corrupted steps (C/T) within the context of DRL environments. However, it does not describe traditional dataset splits (e.g., train/validation/test percentages or counts) for pre-collected data, as DRL agents typically learn through interaction with an environment rather than static datasets. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory, or cloud instances) used for running the experiments. |
| Software Dependencies | No | The implementation of the algorithms is based on the spinningup project (Achiam, 2018). While a project is mentioned, specific version numbers for software libraries, frameworks, or operating systems are not provided. |
| Experiment Setup | Yes | Determing B. Let Vmax, Vmin be the highest and lowest expected reward a policy can get in an episode. We note that Vmax Vmin represents the maximum environment-specific net reward an agent can get during an episode, and (Vmax Vmin) Lmax represents the range of average reward at a time step for an agent. We set B to be higher than (Vmax Vmin) Lmax as a low value of B may not influence the learning process for any attack. A similar choice has also been considered in simpler tabular MDP settings (Zhang et al., 2020b). Determing E. We restrict the value of E to be less than Vmax Vmin to ensure that the perturbation in each episode is less than the net reward. Determing C. We want the attack to corrupt as few training steps as possible, so we set the value of C/T <= 1. Since this is usually the most important budget, we test the efficiency of the attack under different values of C/T [0.005, 0.2]. Determing Attack parameters. For the choice of r in the continuous action spaces, we choose r from a moderate range r [0.3, 0.75] for each learning scenario. In Appendix A, we study the influence of r on the attack and show that the attack can work well within a wide range of values for r. For all attacks, we have | | = B by construction. The sign of of AE and AI attack are given by Definition 5.3 and 5.6. For the UR attack, we test both cases with = B, B and show the best result. For the choice of π , according to Definition 5.6, we randomly generate π for the AI attack. For the AE attack, since the attack works in the black-box setting, π is also randomly generated. We study the influence of the choice of π for the AE attack in Appendix A. Measuring V. To evaluate the value of V , we test the empirical performance of the learned policy after each epoch and report the highest performance as the value of V . We repeat each experiment 10 times and report the average value. We report the variance of results in the appendix. Table 1: Parameters for experiments |