Policy Certificates: Towards Accountable Reinforcement Learning
Authors: Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | One important use case for certificates is to detect sudden performance drops when the distribution of contexts changes. For example, in a call center dialogue system, there can be a sudden increase of customers calling due to a certain regional outage. We demonstrate that certificates can identify such performance drops caused by context shifts. We consider a simulated MDP with 10 states, 40 actions and horizon 5 where rewards depend on a 10-dimensional context and let the distribution of contexts change after 2M episodes. As seen in Figure 1, this causes a spike in optimality gap as well as in the optimality certificates. |
| Researcher Affiliation | Collaboration | Christoph Dann 1 Lihong Li 2 Wei Wei 2 Emma Brunskill 3 1Carnegie Mellon University 2Google Research 3Stanford University. |
| Pseudocode | Yes | Algorithm 1: ORLC (Optimistic Reinforcement Learning with Certificates) ... Algorithm 2: ORLC-SI (Optimistic Reinforcement Learning with Certificates and Side Information) |
| Open Source Code | No | The paper does not provide an explicit statement or link to the open-source code for the described methodology. |
| Open Datasets | No | The paper describes using a "simulated MDP" for experiments (Section 5) rather than a publicly available dataset with concrete access information. While it describes the characteristics of the simulated environment, it does not provide access details for the data itself. |
| Dataset Splits | No | The paper discusses episodic RL and uses a simulated MDP, which involves interactive data collection rather than static dataset splits. It does not provide explicit train/validation/test dataset splits in terms of percentages or sample counts, which are typical for supervised learning setups. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., specific libraries, frameworks, or solvers with their versions). |
| Experiment Setup | Yes | Input :failure tolerance δ (0, 1] ... Input :failure prob. δ (0, 1], regularizer λ > 0 d; ... We consider a simulated MDP with 10 states, 40 actions and horizon 5 where rewards depend on a 10-dimensional context and let the distribution of contexts change after 2M episodes. |