Infinite-horizon Off-Policy Policy Evaluation with Multiple Behavior Policies

Authors: Xinyun Chen, Lu Wang, Yizhe Hang, Heng Ge, Hongyuan Zha

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In extensive experiments with both continuous and discrete environments, we demonstrate that our algorithm offers significantly improved accuracy compared to the state-of-the-art methods. 5 EXPERIMENT In this section, we evaluate EMP on OPPE problems in three discrete-control tasks Taxi, Singlepath, Gridworld and one continuous-control task Pendulum (see Appendix D.1 for the details), in both single-behavior-policy (Section 5.1)and multiple-behavior-policy settings (Section 5.2), with following purposes: (i) to compare the performance of EMP with existing OPPE methods; (ii) to validate the theoretical properties for EMP; (iii) to explore potential improvement of EMP methods for future study.
Researcher Affiliation Academia 1 Insitute for Data and Decision Analytics, The Chinese University of Hong Kong, Shenzhen & Shenzhen Institute of Artificial Intelligence and Robotics for Society 2 Department of Computer Science, East China Normal University 3 Department of Computer Science, University of Science and Technology of China 4 School of Mathematics and Statistics, Shandong University 5 Insitute for Data and Decision Analytics, The Chinese University of Hong Kong, Shenzhen & Shenzhen Institute of Artificial Intelligence and Robotics for Society & Georgia Institute of Technology
Pseudocode No The paper describes algorithms and derivations textually and mathematically, but does not include any structured pseudocode or algorithm blocks.
Open Source Code No We will release the codes with the publication of this paper for relevant study.
Open Datasets Yes Taxi (Dietterich & G, 2000) is a 5 5 grid world simulating a taxi movement. ... Gridworld (Thomas & Brunskill, 2016) is a 4 4 grid world... Pendulum has a continuous state space of R3... The paper refers to established environments like 'Taxi (Dietterich & G, 2000)' and 'Gridworld (Thomas & Brunskill, 2016)', implying standard, publicly available setups for these environments.
Dataset Splits No The paper discusses generating 'trajectories' from policies and evaluating performance, but it does not specify explicit train/validation/test dataset splits with percentages or counts for a fixed dataset. The data is generated dynamically based on the number of trajectories and truncated length.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud computing instance types used for running the experiments.
Software Dependencies No For the continuous environment Pendulum, we use a neural network to model the policy. In detail, we train a two-layer MLP neural network to estimate the policy. The size of the two hidden layers are both 32 with the learning rate 0.001 and tanh activation function. We use MEL (5) and Adam optimizer to train the neural network with batch size 128. ... We use Q-learning in discrete control tasks and Actor Critic in continuous control tasks. While software components are mentioned (neural network, MLP, Adam optimizer, Q-learning, Actor Critic), no specific version numbers for libraries (e.g., PyTorch, TensorFlow) or programming languages are provided.
Experiment Setup Yes For the continuous environment Pendulum, we use a neural network to model the policy. In detail, we train a two-layer MLP neural network to estimate the policy. The size of the two hidden layers are both 32 with the learning rate 0.001 and tanh activation function. We use MEL (5) and Adam optimizer to train the neural network with batch size 128.