Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Defense Against Reward Poisoning Attacks in Reinforcement Learning

Authors: Kiarash Banihashem, Adish Singla, Goran Radanovic

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using simulation-based experiments, we demonstrate the effectiveness and robustness of our defense approach.
Researcher Affiliation Academia Kiarash Banihashem EMAIL University of Maryland Adish Singla EMAIL Max Planck Institute for Software Systems Goran Radanovic EMAIL Max Planck Institute for Software Systems
Pseudocode No The paper describes methods and algorithms in paragraph text and mathematical formulations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The source code for our experiments, as well as instructions for replicating our results can be found in the Supplementary Material.
Open Datasets Yes Navigation environment. Our first environment, shown in Figure 4a is the Navigation environment taken from Rakhsha et al. (2021). Grid world environment. For our second environment, shown in Figure 4b, we use the grid world environment from Ma et al. (2019) with slight modifications in order to ensure ergodicity
Dataset Splits No The paper describes two environments (Navigation and Grid world) with their parameters (states, actions, rewards, transition probabilities, initial state). However, it does not specify any training/test/validation splits for a static dataset, as the experiments are simulation-based reinforcement learning where agents interact with the environment.
Hardware Specification Yes The machine used for obtaining these results is a Macbook Pro personal computer with 4 Gigabytes of memory and a 2.4 GHz Intel Core i5 processor.
Software Dependencies No Given the results in Section 5 (Theorems 5.2 and 5.3), we use the linear programming formulation (P3) together with the CVXPY solver Diamond & Boyd (2016); Agrawal et al. (2018) for calculating the solution to the defense optimization problem (P2b). In the experiments, due to limited numerical precision, Θϵ is calculated with a tolerance parameter, set to 10 4 by default.4. In other words, Θϵ = {(s, a) : |bρπ bρπ {s;a} ϵ| 10 4}. The paper mentions using CVXPY and implies Python, but it does not specify version numbers for any software, libraries, or solvers used.
Experiment Setup Yes The results are obtained with parameters ϵ = 0.1, ϵD = 0.2 and γ = 0.99 (see Section 3). In the experiments, due to limited numerical precision, Θϵ is calculated with a tolerance parameter, set to 10 4 by default. Navigation environment... The initial state is s0 and the discounting factor γ equals 0.99. Grid world environment... The initial state is S and the discounting factor γ equals 0.9.