reproducibilityindex.ai

Counterexample Guided RL Policy Refinement Using Bayesian Optimization

Authors: Briti Gangopadhyay, Pallab Dasgupta

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study the proposed methodology on environments that work over continuous space and continuous or discrete actions. The approach has been tested on several RL environments, and we demonstrate that the policy can be made to respect the safety speciﬁcations through such targeted changes. Section 5 presents case studies on several RL environments
Researcher Affiliation	Academia	Briti Gangopadhyay Department of Computer Science Indian Institute of Technology Kharagpur briti_gangopadhyay@iitkgp.ac.in Pallab Dasgupta Department of Computer Science Indian Institute of Technology Kharagpur pallab@cse.iitkgp.ac.in
Pseudocode	Yes	Algorithm 1: Finding Failure Trajectories; Algorithm 2: Policy Reﬁnement Algorithm
Open Source Code	Yes	Code is available online at https://github.com/britig/policy-reﬁnement-bo
Open Datasets	Yes	We test our framework on a set of environments from Open AI gym [5]. [5] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.
Dataset Splits	No	The paper describes the environments and their use in experiments but does not provide specific details on how datasets were split for training, validation, or testing (e.g., percentages, sample counts, or explicit standard splits).
Hardware Specification	Yes	The experiments were run on a machine with AMD Ryzen 4600h 6 core processor and Ge Force GTX 1660 Graphics unit.
Software Dependencies	No	The paper mentions using specific frameworks like Open AI gym and PPO but does not provide version numbers for these software dependencies, which would be necessary for reproducibility.
Experiment Setup	Yes	We set max_budget = 200 for each iteration of BO. We set the advantage factor At to be 1. We add the evaluation of the objective functions ϕvali with an importance of β along with the reward of a trajectory ξfi while training πc.