Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Policy Teaching in Reinforcement Learning via Environment Poisoning Attacks

Authors: Amin Rakhsha, Goran Radanovic, Rati Devidze, Xiaojin Zhu, Adish Singla

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we perform numerical simulations and empirically investigate the eﬀectiveness of the proposed attacks on two diﬀerent environments. For the reproducibility of experimental results and facilitating research in this area, the source code of our implementation is publicly available.
Researcher Affiliation	Academia	Amin Rakhsha EMAIL Max Planck Institute for Software Systems (MPI-SWS) Saarbrucken, 66123, Germany Goran Radanovic EMAIL Max Planck Institute for Software Systems (MPI-SWS) Saarbrucken, 66123, Germany Rati Devidze EMAIL Max Planck Institute for Software Systems (MPI-SWS) Saarbrucken, 66123, Germany Xiaojin Zhu EMAIL University of Wisconsin-Madison Madison, WI 53706, USA Adish Singla EMAIL Max Planck Institute for Software Systems (MPI-SWS) Saarbrucken, 66123, Germany
Pseudocode	No	The paper describes algorithms and problem formulations using mathematical notation and prose, but it does not contain any explicitly labeled "Pseudocode" or "Algorithm" blocks or figures.
Open Source Code	Yes	For the reproducibility of experimental results and facilitating research in this area, the source code of our implementation is publicly available.10 10https://github.com/adishs/jmlr2021_rl-policy-teaching_code.
Open Datasets	No	The ﬁrst environment we consider is a chain environment represented as an MDP with four states and two actions, see Figure 2. [...] The second environment we consider is a navigation environment represented as an MDP with nine states and two actions per state, see Figure 3. These are custom-defined environments described within the paper, not external publicly available datasets.
Dataset Splits	No	The paper uses custom-defined Markov Decision Process (MDP) environments for numerical simulations rather than fixed datasets. Therefore, the concept of training/test/validation splits for datasets is not applicable and not mentioned in the paper.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models, or memory) used for running the experiments. While run times are reported in Table 1, no hardware specifications are given.
Software Dependencies	No	The paper mentions using specific algorithms like UCRL and Q-learning, but it does not provide details on specific software libraries (e.g., PyTorch, TensorFlow, scikit-learn) or their version numbers used for implementation.
Experiment Setup	Yes	Experimental setup and parameter choices. For all the experiments, we set Cr = 3, Cp = 1, and use ℓ -norm in the measure of the attack cost (see Section 3.1).11 The regularity parameter δ in the problems (P1) and (P2) is set to be 0.0001. In the experiments, we vary R(s0, .) [ 5, 5] and vary ϵ margin [0, 1] for the π policy. [...] For all the experiments, we set Cr = 3, Cp = 1. The regularity parameter δ in the problems (P1) and (P2) is set to be 0.0001. In the experiments, we ﬁx R(s0, .) = 2.5 and ϵ = 0.1 margin for the π policy. [...] For the average reward criteria, we consider an RL agent implementing the UCRL learning algorithm (Auer and Ortner, 2007). For the discounted reward criteria, we consider an RL agent implementing Q-learning with an exploration parameter set to 0.001 (Even-Dar and Mansour, 2003).