Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Efficient Reward Poisoning Attacks on Online Deep Reinforcement Learning

Authors: Yinglun Xu, Qi Zeng, Gagandeep Singh

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide a theoretical analysis of the efficiency of our attack and perform an extensive empirical evaluation. Our results show that our attacks efficiently poison agents learning in several popular classical control and Mu Jo Co environments with a variety of state-of-the-art DRL algorithms, such as DQN, PPO, SAC, etc.
Researcher Affiliation	Academia	Yinglun Xu EMAIL Department of Computer Science University of Illinois at Urbana-Champaign Qi Zeng EMAIL Department of Computer Science University of Illinois at Urbana-Champaign Gagandeep Singh EMAIL Department of Computer Science University of Illinois at Urbana-Champaign
Pseudocode	No	The paper describes methods and algorithms using mathematical formulations and descriptive text, but it does not contain any explicitly labeled pseudocode blocks or algorithm figures.
Open Source Code	No	The implementation of the algorithms is based on the spinningup project (Achiam, 2018). The paper refers to third-party code used for implementation but does not provide an explicit link or statement for the authors' own source code.
Open Datasets	Yes	We consider 4 common Gym environments (Brockman et al., 2016) in the discrete case: Cart Pole, Lunar Lander, Mountain Car, and Acrobot, and 4 continuous cases: Half Cheetah, Hopper, Walker2d, and Swimmer.
Dataset Splits	No	The paper discusses training steps (T) and the proportion of corrupted steps (C/T) within the context of DRL environments. However, it does not describe traditional dataset splits (e.g., train/validation/test percentages or counts) for pre-collected data, as DRL agents typically learn through interaction with an environment rather than static datasets.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory, or cloud instances) used for running the experiments.
Software Dependencies	No	The implementation of the algorithms is based on the spinningup project (Achiam, 2018). While a project is mentioned, specific version numbers for software libraries, frameworks, or operating systems are not provided.
Experiment Setup	Yes	Determing B. Let Vmax, Vmin be the highest and lowest expected reward a policy can get in an episode. We note that Vmax Vmin represents the maximum environment-specific net reward an agent can get during an episode, and (Vmax Vmin) Lmax represents the range of average reward at a time step for an agent. We set B to be higher than (Vmax Vmin) Lmax as a low value of B may not influence the learning process for any attack. A similar choice has also been considered in simpler tabular MDP settings (Zhang et al., 2020b). Determing E. We restrict the value of E to be less than Vmax Vmin to ensure that the perturbation in each episode is less than the net reward. Determing C. We want the attack to corrupt as few training steps as possible, so we set the value of C/T <= 1. Since this is usually the most important budget, we test the efficiency of the attack under different values of C/T [0.005, 0.2]. Determing Attack parameters. For the choice of r in the continuous action spaces, we choose r from a moderate range r [0.3, 0.75] for each learning scenario. In Appendix A, we study the influence of r on the attack and show that the attack can work well within a wide range of values for r. For all attacks, we have \| \| = B by construction. The sign of of AE and AI attack are given by Definition 5.3 and 5.6. For the UR attack, we test both cases with = B, B and show the best result. For the choice of π , according to Definition 5.6, we randomly generate π for the AI attack. For the AE attack, since the attack works in the black-box setting, π is also randomly generated. We study the influence of the choice of π for the AE attack in Appendix A. Measuring V. To evaluate the value of V , we test the empirical performance of the learned policy after each epoch and report the highest performance as the value of V . We repeat each experiment 10 times and report the average value. We report the variance of results in the appendix. Table 1: Parameters for experiments