reproducibilityindex.ai

Reinforcement Learning with Convex Constraints

Authors: Sobhan Miryoosefi, Kianté Brantley, Hal Daume III, Miro Dudik, Robert E. Schapire

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we show that it matches previous algorithms that enforce safety via constraints, but can also enforce new properties that these algorithms cannot incorporate, such as diversity.
Researcher Affiliation	Collaboration	Sobhan Miryooseﬁ Princeton University miryoosefi@cs.princeton.edu Kianté Brantley University of Maryland kdbrant@cs.umd.edu Hal Daumé III Microsoft Research University of Maryland me@hal3.name Miroslav Dudík Microsoft Research mdudik@microsoft.com Robert E. Schapire Microsoft Research schapire@microsoft.com
Pseudocode	Yes	Algorithm 1 Solving a game with repeated play; Algorithm 2 APPROPO
Open Source Code	No	The paper does not provide any links or explicit statements about the availability of open-source code for the described methodology.
Open Datasets	No	The paper mentions using a 'Mars rover grid-world environment' but does not provide access information (link, DOI, etc.) for a publicly available or open dataset.
Dataset Splits	No	The paper does not specify explicit train/validation/test dataset splits. It discusses a reinforcement learning environment where data is generated dynamically, rather than using fixed pre-split datasets.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper mentions using 'A2C' and 'online gradient descent with momentum' but does not provide specific version numbers for these or any other software components.
Experiment Setup	Yes	For a fair comparison, APPROPO uses A2C as a positive-response oracle, with the same hyperparameters as used in RCPO. Online learning in the outer loop of APPROPO was implemented via online gradient descent with momentum. Both RCPO and APPROPO have an outer-loop learning rate parameter, which we tuned over a grid of values 10 i with integer i (see Appendix F for the details)... The agent receives small negative reward each time step and zero for terminating, with γ = 0.99... We used the same safety constraint as Tessler et al. (2019): ensure that the (discounted) probability of hitting a rock is at most a ﬁxed threshold (set to 0.2)... an additional constraint requiring that the reward be at least 0.17... requiring that the Euclidean distance between our visitation probability vector (across the cells of the grid) and the uniform distribution over the upper-right triangle cells of the grid (excluding rocks) be at most 0.12.