Model-Free and Model-Based Policy Evaluation when Causality is Uncertain

Authors: David A Bruns-Smith

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our methods with existing OPE benchmarks. We use the benchmarks from OPE-Tools (Voloshin et al., 2019) for evaluation. In particular, we adapt their three discrete environments, Graph, Discrete MC, and Gridworld, together with a small toy problem. Our lower bounds for the four environments are plotted in Figure 1. The gap between the candidate MDP value and our lower bounds are reported in Table 2.
Researcher Affiliation Academia David Bruns-Smith 1 1Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA. Correspondence to: David Bruns-Smith <bruns-smith@berkeley.edu>.
Pseudocode No The paper describes algorithmic procedures textually and mathematically but does not include explicit pseudocode blocks or algorithms labeled as such.
Open Source Code No The paper does not contain any statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes We use the benchmarks from OPE-Tools (Voloshin et al., 2019) for evaluation. In particular, we adapt their three discrete environments, Graph, Discrete MC, and Gridworld, together with a small toy problem.
Dataset Splits No The paper mentions collecting 'trajectories' for experiments ('For our first experiment, we collect trajectories from each of the four environments using their respective behavior policies. For each environment, we collect 30,000/horizon trajectories'), but it does not specify any training, validation, or test dataset splits.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies No The paper mentions using 'OPE-Tools (Voloshin et al., 2019)' as a benchmark suite but does not specify any software names with version numbers for libraries, frameworks, or other dependencies.
Experiment Setup Yes For our first experiment, we collect trajectories from each of the four environments using their respective behavior policies. For each environment, we collect 30,000/horizon trajectories, keeping the number of data points the same across environments. For the robust MDP bounds, we fix the parameter p = 0.5, i.e. each period the unobserved state is equally likely to be u = 0 or u = 1. We produce bounds for Γ = 1.5, 2, and 10 using our robust MDP method with set to 1,000,000. We also adopt a discount rate of γ = 0.95 so that T = 200 is well beyond the effective horizon.