Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Defense Against Reward Poisoning Attacks in Reinforcement Learning

Authors: Kiarash Banihashem, Adish Singla, Goran Radanovic

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using simulation-based experiments, we demonstrate the eﬀectiveness and robustness of our defense approach.
Researcher Affiliation	Academia	Kiarash Banihashem EMAIL University of Maryland Adish Singla EMAIL Max Planck Institute for Software Systems Goran Radanovic EMAIL Max Planck Institute for Software Systems
Pseudocode	No	The paper describes methods and algorithms in paragraph text and mathematical formulations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The source code for our experiments, as well as instructions for replicating our results can be found in the Supplementary Material.
Open Datasets	Yes	Navigation environment. Our ﬁrst environment, shown in Figure 4a is the Navigation environment taken from Rakhsha et al. (2021). Grid world environment. For our second environment, shown in Figure 4b, we use the grid world environment from Ma et al. (2019) with slight modiﬁcations in order to ensure ergodicity
Dataset Splits	No	The paper describes two environments (Navigation and Grid world) with their parameters (states, actions, rewards, transition probabilities, initial state). However, it does not specify any training/test/validation splits for a static dataset, as the experiments are simulation-based reinforcement learning where agents interact with the environment.
Hardware Specification	Yes	The machine used for obtaining these results is a Macbook Pro personal computer with 4 Gigabytes of memory and a 2.4 GHz Intel Core i5 processor.
Software Dependencies	No	Given the results in Section 5 (Theorems 5.2 and 5.3), we use the linear programming formulation (P3) together with the CVXPY solver Diamond & Boyd (2016); Agrawal et al. (2018) for calculating the solution to the defense optimization problem (P2b). In the experiments, due to limited numerical precision, Θϵ is calculated with a tolerance parameter, set to 10 4 by default.4. In other words, Θϵ = {(s, a) : \|bρπ bρπ {s;a} ϵ\| 10 4}. The paper mentions using CVXPY and implies Python, but it does not specify version numbers for any software, libraries, or solvers used.
Experiment Setup	Yes	The results are obtained with parameters ϵ = 0.1, ϵD = 0.2 and γ = 0.99 (see Section 3). In the experiments, due to limited numerical precision, Θϵ is calculated with a tolerance parameter, set to 10 4 by default. Navigation environment... The initial state is s0 and the discounting factor γ equals 0.99. Grid world environment... The initial state is S and the discounting factor γ equals 0.9.