Safe Reinforcement Learning with Natural Language Constraints
Authors: Tsung-Yen Yang, Michael Y Hu, Yinlam Chow, Peter J Ramadge, Karthik Narasimhan
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across different domains in HAZARDWORLD, we show that our method achieves higher rewards (up to 11x) and fewer constraint violations (by 1.8x) compared to existing approaches. However, in terms of absolute performance, HAZARDWORLD still poses significant challenges for agents to learn efficiently, motivating the need for future work. |
| Researcher Affiliation | Collaboration | Tsung-Yen Yang Princeton University ty3@princeton.edu Michael Hu Princeton University michael.hu@yobi.ventures Yinlam Chow Google Research yinlamchow@google.com Peter J. Ramadge Princeton University ramadge@princeton.edu Karthik Narasimhan Princeton University karthikn@princeton.edu |
| Pseudocode | No | The paper describes the model architecture and training procedures using text and figures (Figures 2 and 3), but it does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and data are available at https://github.com/princeton-nlp/SRL-NLC |
| Open Datasets | Yes | To our knowledge, there do not currently exist datasets for evaluating RL agents that obey textual constraints. Thus, we design a new benchmark called HAZARDWORLD... Code and data are available at https://github.com/princeton-nlp/SRL-NLC |
| Dataset Splits | Yes | We generate two disjoint training and evaluation datasets Dtrain and Deval. Dtrain consists of 10,000 randomly generated maps paired with 80% of the textual constraints (787 constraints overall), i.e., on average each constraint is paired with 12.70 different maps. Deval consists of 5,000 randomly generated maps paired with the remaining 20% of the textual constraints (197 constraints), i.e., on average one constraint is paired with 25.38 maps. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, or cloud computing instance types) used for running its experiments. |
| Software Dependencies | No | The paper mentions building on 'existing RL software frameworks [17, 18]' like 'Baby AI [17, 56]' and 'SAFETY GYM environment [18]', but it does not specify any version numbers for these frameworks or other software dependencies. |
| Experiment Setup | No | The paper states, 'More details on the implementation, hyper-parameters, and computational resources are included in the Appendix A,B, and C.' This indicates that specific experimental setup details are not provided in the main text. |