Verifiable Reinforcement Learning via Policy Extraction

Authors: Osbert Bastani, Yewen Pu, Armando Solar-Lezama

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose VIPER, an algorithm that combines ideas from model compression and imitation learning to learn decision tree policies guided by a DNN policy (called the oracle) and its Q-function, and show that it substantially outperforms two baselines. We use VIPER to (i) learn a provably robust decision tree policy for a variant of Atari Pong with a symbolic state space, (ii) learn a decision tree policy for a toy game based on Pong that provably never loses, and (iii) learn a provably stable decision tree policy for cart-pole. In each case, the decision tree policy achieves performance equal to that of the original DNN policy.
Researcher Affiliation Academia Osbert Bastani MIT obastani@csail.mit.edu Wen Pu MIT yewenpu@mit.edu Armando Solar-Lezama MIT asolar@csail.mit.edu
Pseudocode Yes Algorithm 1 Decision tree policy extraction.
Open Source Code No The paper does not provide an explicit statement about releasing source code or a direct link to a code repository for the methodology described.
Open Datasets Yes For the Atari Pong environment, we use a 7dimensional state space (extracted from raw images)... On the cart-pole benchmark... half-cheetah problem instance in the Mu Jo Co benchmark.
Dataset Splits No The paper does not explicitly specify exact percentages or sample counts for training, validation, and test splits, nor does it reference predefined splits with citations for reproducibility.
Hardware Specification Yes We compute the robustness '(s0; ) at 5 random states s0 2 S, which took just under 2.9 seconds for each point (on a 2.5 GHz Intel Core i7 CPU)
Software Dependencies No The paper mentions specific tools like 'Z3' and 'Reluplex' but does not provide their version numbers, which is necessary for reproducible software dependencies.
Experiment Setup Yes VIPER (with N = 80 iterations and M = 10 sampled traces per iteration) extracts a decision tree policy with 769 nodes that also achieves perfect reward 21.0.