Counterexample Guided RL Policy Refinement Using Bayesian Optimization

Authors: Briti Gangopadhyay, Pallab Dasgupta

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study the proposed methodology on environments that work over continuous space and continuous or discrete actions. The approach has been tested on several RL environments, and we demonstrate that the policy can be made to respect the safety specifications through such targeted changes. Section 5 presents case studies on several RL environments
Researcher Affiliation Academia Briti Gangopadhyay Department of Computer Science Indian Institute of Technology Kharagpur briti_gangopadhyay@iitkgp.ac.in Pallab Dasgupta Department of Computer Science Indian Institute of Technology Kharagpur pallab@cse.iitkgp.ac.in
Pseudocode Yes Algorithm 1: Finding Failure Trajectories; Algorithm 2: Policy Refinement Algorithm
Open Source Code Yes Code is available online at https://github.com/britig/policy-refinement-bo
Open Datasets Yes We test our framework on a set of environments from Open AI gym [5]. [5] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.
Dataset Splits No The paper describes the environments and their use in experiments but does not provide specific details on how datasets were split for training, validation, or testing (e.g., percentages, sample counts, or explicit standard splits).
Hardware Specification Yes The experiments were run on a machine with AMD Ryzen 4600h 6 core processor and Ge Force GTX 1660 Graphics unit.
Software Dependencies No The paper mentions using specific frameworks like Open AI gym and PPO but does not provide version numbers for these software dependencies, which would be necessary for reproducibility.
Experiment Setup Yes We set max_budget = 200 for each iteration of BO. We set the advantage factor At to be 1. We add the evaluation of the objective functions ϕvali with an importance of β along with the reward of a trajectory ξfi while training πc.