Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation
Authors: Qiang Liu, Lihong Li, Ziyang Tang, Dengyong Zhou
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we conduct experiments on different environmental settings to compare our method with existing off-policy evaluation methods. |
| Researcher Affiliation | Collaboration | Qiang Liu The University of Texas at Austin [...] Lihong Li Google Brain [...] Ziyang Tang The University of Texas at Austin [...] Dengyong Zhou Google Brain |
| Pseudocode | No | The paper describes the method mathematically and conceptually but does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide a direct link or explicit statement about the availability of open-source code for the described methodology. |
| Open Datasets | Yes | We modify the environment to make it infinite horizon, by allowing passengers to randomly appear and disappear at every corner of the map at each time step. We use a grid size of 5 x 5, which yields 2000 states in total (25 x 24 x 5,...). [...] The original taxi environment would stop when the taxi successfully picks up a passenger and drops her off at the right place. [...] SUMO [15] is an open source traffic simulator |
| Dataset Splits | No | The paper describes generating trajectories for evaluation but does not specify a training/validation/test split in terms of percentages or counts for a fixed dataset. It evaluates algorithms based on MSE with respect to T-step rewards of trajectories. |
| Hardware Specification | No | The paper does not provide any specific hardware details used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We use a grid size of 5 x 5, which yields 2000 states in total (25 x 24 x 5,...). To construct target and behavior policies for testing our algorithm, we set our target policy to be the final policy after running Q-learning for 1000 iterations, and set another policy + after 950 iterations. The behavior policy is = (1 ) + +, where is a mixing ratio that can be varied. [...] The number of trajectory is fixed to be 100 in (b,c,d). The potential behavior policy + (the right most points in (b)) is used in (a,c,d,e). The default values of the parameters, unless it is varying, are γ = 0.99, n = 200, T = 400. [...] The policy is taken to be a truncated Gaussian whose mean is a neural network of the states and variance a constant. [...] The default parameters are n = 150, T = 1000, γ = 0.99, = 1. |