Reinforcement Learning with Convex Constraints
Authors: Sobhan Miryoosefi, Kianté Brantley, Hal Daume III, Miro Dudik, Robert E. Schapire
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we show that it matches previous algorithms that enforce safety via constraints, but can also enforce new properties that these algorithms cannot incorporate, such as diversity. |
| Researcher Affiliation | Collaboration | Sobhan Miryoosefi Princeton University miryoosefi@cs.princeton.edu Kianté Brantley University of Maryland kdbrant@cs.umd.edu Hal Daumé III Microsoft Research University of Maryland me@hal3.name Miroslav Dudík Microsoft Research mdudik@microsoft.com Robert E. Schapire Microsoft Research schapire@microsoft.com |
| Pseudocode | Yes | Algorithm 1 Solving a game with repeated play; Algorithm 2 APPROPO |
| Open Source Code | No | The paper does not provide any links or explicit statements about the availability of open-source code for the described methodology. |
| Open Datasets | No | The paper mentions using a 'Mars rover grid-world environment' but does not provide access information (link, DOI, etc.) for a publicly available or open dataset. |
| Dataset Splits | No | The paper does not specify explicit train/validation/test dataset splits. It discusses a reinforcement learning environment where data is generated dynamically, rather than using fixed pre-split datasets. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'A2C' and 'online gradient descent with momentum' but does not provide specific version numbers for these or any other software components. |
| Experiment Setup | Yes | For a fair comparison, APPROPO uses A2C as a positive-response oracle, with the same hyperparameters as used in RCPO. Online learning in the outer loop of APPROPO was implemented via online gradient descent with momentum. Both RCPO and APPROPO have an outer-loop learning rate parameter, which we tuned over a grid of values 10 i with integer i (see Appendix F for the details)... The agent receives small negative reward each time step and zero for terminating, with γ = 0.99... We used the same safety constraint as Tessler et al. (2019): ensure that the (discounted) probability of hitting a rock is at most a fixed threshold (set to 0.2)... an additional constraint requiring that the reward be at least 0.17... requiring that the Euclidean distance between our visitation probability vector (across the cells of the grid) and the uniform distribution over the upper-right triangle cells of the grid (excluding rocks) be at most 0.12. |