Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Verifiable Reinforcement Learning via Policy Extraction
Authors: Osbert Bastani, Yewen Pu, Armando Solar-Lezama
NeurIPS 2018 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose VIPER, an algorithm that combines ideas from model compression and imitation learning to learn decision tree policies guided by a DNN policy (called the oracle) and its Q-function, and show that it substantially outperforms two baselines. We use VIPER to (i) learn a provably robust decision tree policy for a variant of Atari Pong with a symbolic state space, (ii) learn a decision tree policy for a toy game based on Pong that provably never loses, and (iii) learn a provably stable decision tree policy for cart-pole. In each case, the decision tree policy achieves performance equal to that of the original DNN policy. |
| Researcher Affiliation | Academia | Osbert Bastani MIT EMAIL Wen Pu MIT EMAIL Armando Solar-Lezama MIT EMAIL |
| Pseudocode | Yes | Algorithm 1 Decision tree policy extraction. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a direct link to a code repository for the methodology described. |
| Open Datasets | Yes | For the Atari Pong environment, we use a 7dimensional state space (extracted from raw images)... On the cart-pole benchmark... half-cheetah problem instance in the Mu Jo Co benchmark. |
| Dataset Splits | No | The paper does not explicitly specify exact percentages or sample counts for training, validation, and test splits, nor does it reference predefined splits with citations for reproducibility. |
| Hardware Specification | Yes | We compute the robustness '(s0; ) at 5 random states s0 2 S, which took just under 2.9 seconds for each point (on a 2.5 GHz Intel Core i7 CPU) |
| Software Dependencies | No | The paper mentions specific tools like 'Z3' and 'Reluplex' but does not provide their version numbers, which is necessary for reproducible software dependencies. |
| Experiment Setup | Yes | VIPER (with N = 80 iterations and M = 10 sampled traces per iteration) extracts a decision tree policy with 769 nodes that also achieves perfect reward 21.0. |