reproducibilityindex.ai

Representation Balancing MDPs for Off-policy Policy Evaluation

Authors: Yao Liu, Omer Gottesman, Aniruddh Raghu, Matthieu Komorowski, Aldo A. Faisal, Finale Doshi-Velez, Emma Brunskill

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that our resulting models can yield substantially lower mean squared error estimators than prior model-based and IS-based estimators on a classic benchmark RL task (even when the IS-based estimators are given access to the true behavior policy). We also demonstrate our approach can yield improved results on a HIV treatment simulator [6].
Researcher Affiliation	Academia	Yao Liu Stanford University yaoliu@stanford.edu Omer Gottesman Harvard University gottesman@fas.harvard.edu Aniruddh Raghu Cambridge University aniruddhraghu@gmail.com Matthieu Komorowski Imperial College London matthieu.komorowski@gmail.com Aldo Faisal Imperial College London a.faisal@imperial.ac.uk Finale Doshi-Velez Harvard University finale@seas.harvard.edu Emma Brunskill Stanford University ebrun@cs.stanford.edu
Pseudocode	No	The paper describes the algorithm and objective function, but does not include a formal pseudocode block or algorithm figure.
Open Source Code	No	The paper does not provide any explicit statements about making its source code publicly available.
Open Datasets	Yes	We test our algorithm on two continuous-state benchmark domains. We use a greedy policy from a learned Q function as the evaluation policy, and an -greedy policy with = 0.2 as the behavior policy. We collect 1024 trajectories for OPPE. In Cart Pole domain the average length of trajectories is around 190 (long horizon variant), or around 23 (short horizon variant). In Mountain Car the average length of trajectories is around 150. The HIV simulator is described in Ernst et al. [6]
Dataset Splits	No	The paper describes data collection from simulation environments but does not specify explicit train/validation/test dataset splits.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies used in the experiments.
Experiment Setup	Yes	The reported results are square root of the average MSE over 100 runs. is set to 0.01 for Rep BM. We collect 1024 trajectories for OPPE. We learn an evaluation policy by ﬁtted Q iteration and use the -greedy policy of the optimal Q function as the behavior policy.