Provably Efficient Neural GTD for Off-Policy Learning

Authors: Hoi-To Wai, Zhuoran Yang, Zhaoran Wang, Mingyi Hong

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform preliminary experiments to support the above theories on a toy example of off-policy learning. ... In Fig. 1, we compare the average MSBE against the number of neurons m, using a 2-layer, Re LU NN with random initialization according to H1, after T = 3 * 10^5 iterations of neural GTD and neural TD [Cai et al., 2019] run with Markovian samples [cf. Algorithm 1], from 10 independent runs of state/action.
Researcher Affiliation Academia Hoi-To Wai The Chinese University of Hong Kong ... Zhuoran Yang Princeton University ... Zhaoran Wang Northwestern University ... Mingyi Hong University of Minnesota
Pseudocode Yes Algorithm 1 Neural GTD algorithms for MSBE
Open Source Code No The paper does not provide any information about open-source code for the methodology.
Open Datasets No We consider an MDP taken from the Garnet class with |S| = 500 states, |A| = 5 possible actions per state with uniformly distributed rewards, and the discount factor is gamma = 0.9. We generate two random policies with the same support as the behavior/target policies, respectively. This describes a simulated environment setup, but not a publicly available dataset with a link or formal citation.
Dataset Splits No The paper describes a simulation environment and runs for a fixed number of iterations (T = 3 * 10^5) and independent runs (10), but does not explicitly mention training, validation, or test dataset splits or percentages.
Hardware Specification No The paper does not specify any hardware used for running the experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers.
Experiment Setup Yes We consider an MDP taken from the Garnet class with |S| = 500 states, |A| = 5 possible actions per state with uniformly distributed rewards, and the discount factor is gamma = 0.9. We generate two random policies with the same support as the behavior/target policies, respectively. In Fig. 1, we compare the average MSBE against the number of neurons m, using a 2-layer, Re LU NN with random initialization according to H1, after T = 3 * 10^5 iterations of neural GTD and neural TD [Cai et al., 2019] run with Markovian samples [cf. Algorithm 1], from 10 independent runs of state/action.