Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Policy Teaching in Reinforcement Learning via Environment Poisoning Attacks

Authors: Amin Rakhsha, Goran Radanovic, Rati Devidze, Xiaojin Zhu, Adish Singla

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we perform numerical simulations and empirically investigate the effectiveness of the proposed attacks on two different environments. For the reproducibility of experimental results and facilitating research in this area, the source code of our implementation is publicly available.
Researcher Affiliation Academia Amin Rakhsha EMAIL Max Planck Institute for Software Systems (MPI-SWS) Saarbrucken, 66123, Germany Goran Radanovic EMAIL Max Planck Institute for Software Systems (MPI-SWS) Saarbrucken, 66123, Germany Rati Devidze EMAIL Max Planck Institute for Software Systems (MPI-SWS) Saarbrucken, 66123, Germany Xiaojin Zhu EMAIL University of Wisconsin-Madison Madison, WI 53706, USA Adish Singla EMAIL Max Planck Institute for Software Systems (MPI-SWS) Saarbrucken, 66123, Germany
Pseudocode No The paper describes algorithms and problem formulations using mathematical notation and prose, but it does not contain any explicitly labeled "Pseudocode" or "Algorithm" blocks or figures.
Open Source Code Yes For the reproducibility of experimental results and facilitating research in this area, the source code of our implementation is publicly available.10 10https://github.com/adishs/jmlr2021_rl-policy-teaching_code.
Open Datasets No The first environment we consider is a chain environment represented as an MDP with four states and two actions, see Figure 2. [...] The second environment we consider is a navigation environment represented as an MDP with nine states and two actions per state, see Figure 3. These are custom-defined environments described within the paper, not external publicly available datasets.
Dataset Splits No The paper uses custom-defined Markov Decision Process (MDP) environments for numerical simulations rather than fixed datasets. Therefore, the concept of training/test/validation splits for datasets is not applicable and not mentioned in the paper.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, or memory) used for running the experiments. While run times are reported in Table 1, no hardware specifications are given.
Software Dependencies No The paper mentions using specific algorithms like UCRL and Q-learning, but it does not provide details on specific software libraries (e.g., PyTorch, TensorFlow, scikit-learn) or their version numbers used for implementation.
Experiment Setup Yes Experimental setup and parameter choices. For all the experiments, we set Cr = 3, Cp = 1, and use ℓ -norm in the measure of the attack cost (see Section 3.1).11 The regularity parameter δ in the problems (P1) and (P2) is set to be 0.0001. In the experiments, we vary R(s0, .) [ 5, 5] and vary ϵ margin [0, 1] for the π policy. [...] For all the experiments, we set Cr = 3, Cp = 1. The regularity parameter δ in the problems (P1) and (P2) is set to be 0.0001. In the experiments, we fix R(s0, .) = 2.5 and ϵ = 0.1 margin for the π policy. [...] For the average reward criteria, we consider an RL agent implementing the UCRL learning algorithm (Auer and Ortner, 2007). For the discounted reward criteria, we consider an RL agent implementing Q-learning with an exploration parameter set to 0.001 (Even-Dar and Mansour, 2003).