Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Defense Against Reward Poisoning Attacks in Reinforcement Learning
Authors: Kiarash Banihashem, Adish Singla, Goran Radanovic
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using simulation-based experiments, we demonstrate the effectiveness and robustness of our defense approach. |
| Researcher Affiliation | Academia | Kiarash Banihashem EMAIL University of Maryland Adish Singla EMAIL Max Planck Institute for Software Systems Goran Radanovic EMAIL Max Planck Institute for Software Systems |
| Pseudocode | No | The paper describes methods and algorithms in paragraph text and mathematical formulations but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code for our experiments, as well as instructions for replicating our results can be found in the Supplementary Material. |
| Open Datasets | Yes | Navigation environment. Our first environment, shown in Figure 4a is the Navigation environment taken from Rakhsha et al. (2021). Grid world environment. For our second environment, shown in Figure 4b, we use the grid world environment from Ma et al. (2019) with slight modifications in order to ensure ergodicity |
| Dataset Splits | No | The paper describes two environments (Navigation and Grid world) with their parameters (states, actions, rewards, transition probabilities, initial state). However, it does not specify any training/test/validation splits for a static dataset, as the experiments are simulation-based reinforcement learning where agents interact with the environment. |
| Hardware Specification | Yes | The machine used for obtaining these results is a Macbook Pro personal computer with 4 Gigabytes of memory and a 2.4 GHz Intel Core i5 processor. |
| Software Dependencies | No | Given the results in Section 5 (Theorems 5.2 and 5.3), we use the linear programming formulation (P3) together with the CVXPY solver Diamond & Boyd (2016); Agrawal et al. (2018) for calculating the solution to the defense optimization problem (P2b). In the experiments, due to limited numerical precision, Θϵ is calculated with a tolerance parameter, set to 10 4 by default.4. In other words, Θϵ = {(s, a) : |bρπ bρπ {s;a} ϵ| 10 4}. The paper mentions using CVXPY and implies Python, but it does not specify version numbers for any software, libraries, or solvers used. |
| Experiment Setup | Yes | The results are obtained with parameters ϵ = 0.1, ϵD = 0.2 and γ = 0.99 (see Section 3). In the experiments, due to limited numerical precision, Θϵ is calculated with a tolerance parameter, set to 10 4 by default. Navigation environment... The initial state is s0 and the discounting factor γ equals 0.99. Grid world environment... The initial state is S and the discounting factor γ equals 0.9. |