Safe Reinforcement Learning with Natural Language Constraints

Authors: Tsung-Yen Yang, Michael Y Hu, Yinlam Chow, Peter J Ramadge, Karthik Narasimhan

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across different domains in HAZARDWORLD, we show that our method achieves higher rewards (up to 11x) and fewer constraint violations (by 1.8x) compared to existing approaches. However, in terms of absolute performance, HAZARDWORLD still poses significant challenges for agents to learn efficiently, motivating the need for future work.
Researcher Affiliation Collaboration Tsung-Yen Yang Princeton University ty3@princeton.edu Michael Hu Princeton University michael.hu@yobi.ventures Yinlam Chow Google Research yinlamchow@google.com Peter J. Ramadge Princeton University ramadge@princeton.edu Karthik Narasimhan Princeton University karthikn@princeton.edu
Pseudocode No The paper describes the model architecture and training procedures using text and figures (Figures 2 and 3), but it does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Code and data are available at https://github.com/princeton-nlp/SRL-NLC
Open Datasets Yes To our knowledge, there do not currently exist datasets for evaluating RL agents that obey textual constraints. Thus, we design a new benchmark called HAZARDWORLD... Code and data are available at https://github.com/princeton-nlp/SRL-NLC
Dataset Splits Yes We generate two disjoint training and evaluation datasets Dtrain and Deval. Dtrain consists of 10,000 randomly generated maps paired with 80% of the textual constraints (787 constraints overall), i.e., on average each constraint is paired with 12.70 different maps. Deval consists of 5,000 randomly generated maps paired with the remaining 20% of the textual constraints (197 constraints), i.e., on average one constraint is paired with 25.38 maps.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, or cloud computing instance types) used for running its experiments.
Software Dependencies No The paper mentions building on 'existing RL software frameworks [17, 18]' like 'Baby AI [17, 56]' and 'SAFETY GYM environment [18]', but it does not specify any version numbers for these frameworks or other software dependencies.
Experiment Setup No The paper states, 'More details on the implementation, hyper-parameters, and computational resources are included in the Appendix A,B, and C.' This indicates that specific experimental setup details are not provided in the main text.