Verifiable Reinforcement Learning via Policy Extraction
Authors: Osbert Bastani, Yewen Pu, Armando Solar-Lezama
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose VIPER, an algorithm that combines ideas from model compression and imitation learning to learn decision tree policies guided by a DNN policy (called the oracle) and its Q-function, and show that it substantially outperforms two baselines. We use VIPER to (i) learn a provably robust decision tree policy for a variant of Atari Pong with a symbolic state space, (ii) learn a decision tree policy for a toy game based on Pong that provably never loses, and (iii) learn a provably stable decision tree policy for cart-pole. In each case, the decision tree policy achieves performance equal to that of the original DNN policy. |
| Researcher Affiliation | Academia | Osbert Bastani MIT obastani@csail.mit.edu Wen Pu MIT yewenpu@mit.edu Armando Solar-Lezama MIT asolar@csail.mit.edu |
| Pseudocode | Yes | Algorithm 1 Decision tree policy extraction. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a direct link to a code repository for the methodology described. |
| Open Datasets | Yes | For the Atari Pong environment, we use a 7dimensional state space (extracted from raw images)... On the cart-pole benchmark... half-cheetah problem instance in the Mu Jo Co benchmark. |
| Dataset Splits | No | The paper does not explicitly specify exact percentages or sample counts for training, validation, and test splits, nor does it reference predefined splits with citations for reproducibility. |
| Hardware Specification | Yes | We compute the robustness '(s0; ) at 5 random states s0 2 S, which took just under 2.9 seconds for each point (on a 2.5 GHz Intel Core i7 CPU) |
| Software Dependencies | No | The paper mentions specific tools like 'Z3' and 'Reluplex' but does not provide their version numbers, which is necessary for reproducible software dependencies. |
| Experiment Setup | Yes | VIPER (with N = 80 iterations and M = 10 sampled traces per iteration) extracts a decision tree policy with 769 nodes that also achieves perfect reward 21.0. |