Policy Certificates: Towards Accountable Reinforcement Learning

Authors: Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental One important use case for certificates is to detect sudden performance drops when the distribution of contexts changes. For example, in a call center dialogue system, there can be a sudden increase of customers calling due to a certain regional outage. We demonstrate that certificates can identify such performance drops caused by context shifts. We consider a simulated MDP with 10 states, 40 actions and horizon 5 where rewards depend on a 10-dimensional context and let the distribution of contexts change after 2M episodes. As seen in Figure 1, this causes a spike in optimality gap as well as in the optimality certificates.
Researcher Affiliation Collaboration Christoph Dann 1 Lihong Li 2 Wei Wei 2 Emma Brunskill 3 1Carnegie Mellon University 2Google Research 3Stanford University.
Pseudocode Yes Algorithm 1: ORLC (Optimistic Reinforcement Learning with Certificates) ... Algorithm 2: ORLC-SI (Optimistic Reinforcement Learning with Certificates and Side Information)
Open Source Code No The paper does not provide an explicit statement or link to the open-source code for the described methodology.
Open Datasets No The paper describes using a "simulated MDP" for experiments (Section 5) rather than a publicly available dataset with concrete access information. While it describes the characteristics of the simulated environment, it does not provide access details for the data itself.
Dataset Splits No The paper discusses episodic RL and uses a simulated MDP, which involves interactive data collection rather than static dataset splits. It does not provide explicit train/validation/test dataset splits in terms of percentages or sample counts, which are typical for supervised learning setups.
Hardware Specification No The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., specific libraries, frameworks, or solvers with their versions).
Experiment Setup Yes Input :failure tolerance δ (0, 1] ... Input :failure prob. δ (0, 1], regularizer λ > 0 d; ... We consider a simulated MDP with 10 states, 40 actions and horizon 5 where rewards depend on a 10-dimensional context and let the distribution of contexts change after 2M episodes.