Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Your Policy Regularizer is Secretly an Adversary
Authors: Rob Brekelmans, Tim Genewein, Jordi Grau-Moya, Gregoire Detetang, Markus Kunesch, Shane Legg, Pedro A Ortega
TMLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform experiments for a sequential grid-world task in Sec. 4 where, in contrast to previous work, we explicitly visualize the reward robustness and adversarial strategies resulting from our theory. |
| Researcher Affiliation | Collaboration | Rob Brekelmans EMAIL University of Southern California Information Sciences Institute Tim Genewein EMAIL Jordi Grau-Moya Grégoire Delétang Markus Kunesch Shane Legg Pedro Ortega EMAIL Deep Mind |
| Pseudocode | No | The paper provides mathematical formulations and derivations, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | No | In Fig. 4(a), we consider a grid world where the agent receives +5 for picking up the reward pill, -1 for stepping in water, and zero reward otherwise. (This describes a custom-defined environment, not a publicly available dataset with access information.) |
| Dataset Splits | No | The paper describes experiments in a 'grid world' environment, which is a simulated task. It does not mention any training, validation, or test dataset splits, as it's not applicable in the traditional sense for this type of simulated environment. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as GPU/CPU models, memory, or processor types. |
| Software Dependencies | No | The paper mentions 'cvx-py (Diamond & Boyd, 2016)' as a tool used, but does not provide a specific version number for cvx-py or any other key software dependencies. |
| Experiment Setup | Yes | We train an agent using tabular Q-learning and a discount factor γ = 0.99. We consider the single-step example in Sec. 4.1 Fig. 2 or App. H Fig. 10-11, with a two-dimensional action space, optimal state-action value estimates, Q (a, s) = r(a, s) = {1.1, 0.8}, and uniform prior π0(a|s). The case of policy regularization with α = 2 and β = 10 is particularly interesting. |