reproducibilityindex.ai

Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

Authors: Qiang Liu, Lihong Li, Ziyang Tang, Dengyong Zhou

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we conduct experiments on different environmental settings to compare our method with existing off-policy evaluation methods.
Researcher Affiliation	Collaboration	Qiang Liu The University of Texas at Austin [...] Lihong Li Google Brain [...] Ziyang Tang The University of Texas at Austin [...] Dengyong Zhou Google Brain
Pseudocode	No	The paper describes the method mathematically and conceptually but does not include any formal pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide a direct link or explicit statement about the availability of open-source code for the described methodology.
Open Datasets	Yes	We modify the environment to make it inﬁnite horizon, by allowing passengers to randomly appear and disappear at every corner of the map at each time step. We use a grid size of 5 x 5, which yields 2000 states in total (25 x 24 x 5,...). [...] The original taxi environment would stop when the taxi successfully picks up a passenger and drops her off at the right place. [...] SUMO [15] is an open source trafﬁc simulator
Dataset Splits	No	The paper describes generating trajectories for evaluation but does not specify a training/validation/test split in terms of percentages or counts for a fixed dataset. It evaluates algorithms based on MSE with respect to T-step rewards of trajectories.
Hardware Specification	No	The paper does not provide any specific hardware details used for running the experiments.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	We use a grid size of 5 x 5, which yields 2000 states in total (25 x 24 x 5,...). To construct target and behavior policies for testing our algorithm, we set our target policy to be the ﬁnal policy after running Q-learning for 1000 iterations, and set another policy + after 950 iterations. The behavior policy is = (1 ) + +, where is a mixing ratio that can be varied. [...] The number of trajectory is ﬁxed to be 100 in (b,c,d). The potential behavior policy + (the right most points in (b)) is used in (a,c,d,e). The default values of the parameters, unless it is varying, are γ = 0.99, n = 200, T = 400. [...] The policy is taken to be a truncated Gaussian whose mean is a neural network of the states and variance a constant. [...] The default parameters are n = 150, T = 1000, γ = 0.99, = 1.