Provably Efficient Neural GTD for Off-Policy Learning
Authors: Hoi-To Wai, Zhuoran Yang, Zhaoran Wang, Mingyi Hong
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform preliminary experiments to support the above theories on a toy example of off-policy learning. ... In Fig. 1, we compare the average MSBE against the number of neurons m, using a 2-layer, Re LU NN with random initialization according to H1, after T = 3 * 10^5 iterations of neural GTD and neural TD [Cai et al., 2019] run with Markovian samples [cf. Algorithm 1], from 10 independent runs of state/action. |
| Researcher Affiliation | Academia | Hoi-To Wai The Chinese University of Hong Kong ... Zhuoran Yang Princeton University ... Zhaoran Wang Northwestern University ... Mingyi Hong University of Minnesota |
| Pseudocode | Yes | Algorithm 1 Neural GTD algorithms for MSBE |
| Open Source Code | No | The paper does not provide any information about open-source code for the methodology. |
| Open Datasets | No | We consider an MDP taken from the Garnet class with |S| = 500 states, |A| = 5 possible actions per state with uniformly distributed rewards, and the discount factor is gamma = 0.9. We generate two random policies with the same support as the behavior/target policies, respectively. This describes a simulated environment setup, but not a publicly available dataset with a link or formal citation. |
| Dataset Splits | No | The paper describes a simulation environment and runs for a fixed number of iterations (T = 3 * 10^5) and independent runs (10), but does not explicitly mention training, validation, or test dataset splits or percentages. |
| Hardware Specification | No | The paper does not specify any hardware used for running the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | We consider an MDP taken from the Garnet class with |S| = 500 states, |A| = 5 possible actions per state with uniformly distributed rewards, and the discount factor is gamma = 0.9. We generate two random policies with the same support as the behavior/target policies, respectively. In Fig. 1, we compare the average MSBE against the number of neurons m, using a 2-layer, Re LU NN with random initialization according to H1, after T = 3 * 10^5 iterations of neural GTD and neural TD [Cai et al., 2019] run with Markovian samples [cf. Algorithm 1], from 10 independent runs of state/action. |