Twice regularized MDPs and the equivalence between robustness and regularization

Authors: Esther Derman, Matthieu Geist, Shie Mannor

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6 Numerical Experiments We aim to compare the computing time of R2 MPI with that of MPI [30] and robust MPI [18]. The code is available at https://github.com/Esther Derman/r2mdp. To do so, we run experiments on an Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz machine, which we test on a 5 5 grid-world domain.
Researcher Affiliation Collaboration Esther Derman Technion Matthieu Geist Google Research, Brain Team Shie Mannor Technion, NVIDIA Research
Pseudocode Yes Algorithm 1: R2 MPI Result: πk+1, vk+1 Initialize vk RS; while not converged do πk+1 GΩR2 (vk); vk+1 (T πk+1,R2)mvk; end
Open Source Code Yes The code is available at https://github.com/Esther Derman/r2mdp.
Open Datasets No The paper describes using a '5x5 grid-world domain' for experiments, which is a custom simulation environment rather than a publicly available dataset with specific access details like a URL, DOI, or formal citation.
Dataset Splits No The paper describes a custom simulation environment ('5x5 grid-world domain') and evaluates algorithms within it. It does not mention using standard train/validation/test splits from a dataset or provide specific percentages or sample counts for such splits.
Hardware Specification Yes To do so, we run experiments on an Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz machine
Software Dependencies No The paper states that 'The code is available at https://github.com/Esther Derman/r2mdp' but does not specify any software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes Parameter values and other implementation details are deferred to Appx. D. We obtain the same value for R2 PE and robust PE, which numerically confirms Thm. 4.1. For simplicity, we focus on an (s, a)-rectangular uncertainty set and take the same ball radius α (resp. β) at each state-action pair for the reward function (resp. transition function).