reproducibilityindex.ai

Verifiable Reinforcement Learning via Policy Extraction

Authors: Osbert Bastani, Yewen Pu, Armando Solar-Lezama

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose VIPER, an algorithm that combines ideas from model compression and imitation learning to learn decision tree policies guided by a DNN policy (called the oracle) and its Q-function, and show that it substantially outperforms two baselines. We use VIPER to (i) learn a provably robust decision tree policy for a variant of Atari Pong with a symbolic state space, (ii) learn a decision tree policy for a toy game based on Pong that provably never loses, and (iii) learn a provably stable decision tree policy for cart-pole. In each case, the decision tree policy achieves performance equal to that of the original DNN policy.
Researcher Affiliation	Academia	Osbert Bastani MIT obastani@csail.mit.edu Wen Pu MIT yewenpu@mit.edu Armando Solar-Lezama MIT asolar@csail.mit.edu
Pseudocode	Yes	Algorithm 1 Decision tree policy extraction.
Open Source Code	No	The paper does not provide an explicit statement about releasing source code or a direct link to a code repository for the methodology described.
Open Datasets	Yes	For the Atari Pong environment, we use a 7dimensional state space (extracted from raw images)... On the cart-pole benchmark... half-cheetah problem instance in the Mu Jo Co benchmark.
Dataset Splits	No	The paper does not explicitly specify exact percentages or sample counts for training, validation, and test splits, nor does it reference predefined splits with citations for reproducibility.
Hardware Specification	Yes	We compute the robustness '(s0; ) at 5 random states s0 2 S, which took just under 2.9 seconds for each point (on a 2.5 GHz Intel Core i7 CPU)
Software Dependencies	No	The paper mentions specific tools like 'Z3' and 'Reluplex' but does not provide their version numbers, which is necessary for reproducible software dependencies.
Experiment Setup	Yes	VIPER (with N = 80 iterations and M = 10 sampled traces per iteration) extracts a decision tree policy with 769 nodes that also achieves perfect reward 21.0.