Representation Balancing MDPs for Off-policy Policy Evaluation

Authors: Yao Liu, Omer Gottesman, Aniruddh Raghu, Matthieu Komorowski, Aldo A. Faisal, Finale Doshi-Velez, Emma Brunskill

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that our resulting models can yield substantially lower mean squared error estimators than prior model-based and IS-based estimators on a classic benchmark RL task (even when the IS-based estimators are given access to the true behavior policy). We also demonstrate our approach can yield improved results on a HIV treatment simulator [6].
Researcher Affiliation Academia Yao Liu Stanford University yaoliu@stanford.edu Omer Gottesman Harvard University gottesman@fas.harvard.edu Aniruddh Raghu Cambridge University aniruddhraghu@gmail.com Matthieu Komorowski Imperial College London matthieu.komorowski@gmail.com Aldo Faisal Imperial College London a.faisal@imperial.ac.uk Finale Doshi-Velez Harvard University finale@seas.harvard.edu Emma Brunskill Stanford University ebrun@cs.stanford.edu
Pseudocode No The paper describes the algorithm and objective function, but does not include a formal pseudocode block or algorithm figure.
Open Source Code No The paper does not provide any explicit statements about making its source code publicly available.
Open Datasets Yes We test our algorithm on two continuous-state benchmark domains. We use a greedy policy from a learned Q function as the evaluation policy, and an -greedy policy with = 0.2 as the behavior policy. We collect 1024 trajectories for OPPE. In Cart Pole domain the average length of trajectories is around 190 (long horizon variant), or around 23 (short horizon variant). In Mountain Car the average length of trajectories is around 150. The HIV simulator is described in Ernst et al. [6]
Dataset Splits No The paper describes data collection from simulation environments but does not specify explicit train/validation/test dataset splits.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies used in the experiments.
Experiment Setup Yes The reported results are square root of the average MSE over 100 runs. is set to 0.01 for Rep BM. We collect 1024 trajectories for OPPE. We learn an evaluation policy by fitted Q iteration and use the -greedy policy of the optimal Q function as the behavior policy.