reproducibilityindex.ai

Policy Certificates: Towards Accountable Reinforcement Learning

Authors: Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	One important use case for certiﬁcates is to detect sudden performance drops when the distribution of contexts changes. For example, in a call center dialogue system, there can be a sudden increase of customers calling due to a certain regional outage. We demonstrate that certiﬁcates can identify such performance drops caused by context shifts. We consider a simulated MDP with 10 states, 40 actions and horizon 5 where rewards depend on a 10-dimensional context and let the distribution of contexts change after 2M episodes. As seen in Figure 1, this causes a spike in optimality gap as well as in the optimality certiﬁcates.
Researcher Affiliation	Collaboration	Christoph Dann 1 Lihong Li 2 Wei Wei 2 Emma Brunskill 3 1Carnegie Mellon University 2Google Research 3Stanford University.
Pseudocode	Yes	Algorithm 1: ORLC (Optimistic Reinforcement Learning with Certiﬁcates) ... Algorithm 2: ORLC-SI (Optimistic Reinforcement Learning with Certiﬁcates and Side Information)
Open Source Code	No	The paper does not provide an explicit statement or link to the open-source code for the described methodology.
Open Datasets	No	The paper describes using a "simulated MDP" for experiments (Section 5) rather than a publicly available dataset with concrete access information. While it describes the characteristics of the simulated environment, it does not provide access details for the data itself.
Dataset Splits	No	The paper discusses episodic RL and uses a simulated MDP, which involves interactive data collection rather than static dataset splits. It does not provide explicit train/validation/test dataset splits in terms of percentages or sample counts, which are typical for supervised learning setups.
Hardware Specification	No	The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers (e.g., specific libraries, frameworks, or solvers with their versions).
Experiment Setup	Yes	Input :failure tolerance δ (0, 1] ... Input :failure prob. δ (0, 1], regularizer λ > 0 d; ... We consider a simulated MDP with 10 states, 40 actions and horizon 5 where rewards depend on a 10-dimensional context and let the distribution of contexts change after 2M episodes.