reproducibilityindex.ai

Provably Efficient Neural GTD for Off-Policy Learning

Authors: Hoi-To Wai, Zhuoran Yang, Zhaoran Wang, Mingyi Hong

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform preliminary experiments to support the above theories on a toy example of off-policy learning. ... In Fig. 1, we compare the average MSBE against the number of neurons m, using a 2-layer, Re LU NN with random initialization according to H1, after T = 3 * 10^5 iterations of neural GTD and neural TD [Cai et al., 2019] run with Markovian samples [cf. Algorithm 1], from 10 independent runs of state/action.
Researcher Affiliation	Academia	Hoi-To Wai The Chinese University of Hong Kong ... Zhuoran Yang Princeton University ... Zhaoran Wang Northwestern University ... Mingyi Hong University of Minnesota
Pseudocode	Yes	Algorithm 1 Neural GTD algorithms for MSBE
Open Source Code	No	The paper does not provide any information about open-source code for the methodology.
Open Datasets	No	We consider an MDP taken from the Garnet class with \|S\| = 500 states, \|A\| = 5 possible actions per state with uniformly distributed rewards, and the discount factor is gamma = 0.9. We generate two random policies with the same support as the behavior/target policies, respectively. This describes a simulated environment setup, but not a publicly available dataset with a link or formal citation.
Dataset Splits	No	The paper describes a simulation environment and runs for a fixed number of iterations (T = 3 * 10^5) and independent runs (10), but does not explicitly mention training, validation, or test dataset splits or percentages.
Hardware Specification	No	The paper does not specify any hardware used for running the experiments.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers.
Experiment Setup	Yes	We consider an MDP taken from the Garnet class with \|S\| = 500 states, \|A\| = 5 possible actions per state with uniformly distributed rewards, and the discount factor is gamma = 0.9. We generate two random policies with the same support as the behavior/target policies, respectively. In Fig. 1, we compare the average MSBE against the number of neurons m, using a 2-layer, Re LU NN with random initialization according to H1, after T = 3 * 10^5 iterations of neural GTD and neural TD [Cai et al., 2019] run with Markovian samples [cf. Algorithm 1], from 10 independent runs of state/action.