Reinforcement Learning with Convex Constraints

Authors: Sobhan Miryoosefi, Kianté Brantley, Hal Daume III, Miro Dudik, Robert E. Schapire

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we show that it matches previous algorithms that enforce safety via constraints, but can also enforce new properties that these algorithms cannot incorporate, such as diversity.
Researcher Affiliation Collaboration Sobhan Miryoosefi Princeton University miryoosefi@cs.princeton.edu Kianté Brantley University of Maryland kdbrant@cs.umd.edu Hal Daumé III Microsoft Research University of Maryland me@hal3.name Miroslav Dudík Microsoft Research mdudik@microsoft.com Robert E. Schapire Microsoft Research schapire@microsoft.com
Pseudocode Yes Algorithm 1 Solving a game with repeated play; Algorithm 2 APPROPO
Open Source Code No The paper does not provide any links or explicit statements about the availability of open-source code for the described methodology.
Open Datasets No The paper mentions using a 'Mars rover grid-world environment' but does not provide access information (link, DOI, etc.) for a publicly available or open dataset.
Dataset Splits No The paper does not specify explicit train/validation/test dataset splits. It discusses a reinforcement learning environment where data is generated dynamically, rather than using fixed pre-split datasets.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions using 'A2C' and 'online gradient descent with momentum' but does not provide specific version numbers for these or any other software components.
Experiment Setup Yes For a fair comparison, APPROPO uses A2C as a positive-response oracle, with the same hyperparameters as used in RCPO. Online learning in the outer loop of APPROPO was implemented via online gradient descent with momentum. Both RCPO and APPROPO have an outer-loop learning rate parameter, which we tuned over a grid of values 10 i with integer i (see Appendix F for the details)... The agent receives small negative reward each time step and zero for terminating, with γ = 0.99... We used the same safety constraint as Tessler et al. (2019): ensure that the (discounted) probability of hitting a rock is at most a fixed threshold (set to 0.2)... an additional constraint requiring that the reward be at least 0.17... requiring that the Euclidean distance between our visitation probability vector (across the cells of the grid) and the uniform distribution over the upper-right triangle cells of the grid (excluding rocks) be at most 0.12.