Ordering-based Conditions for Global Convergence of Policy Gradient Methods
Authors: Jincheng Mei, Bo Dai, Alekh Agarwal, Mohammad Ghavamzadeh, Csaba Szepesvari, Dale Schuurmans
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide experimental results to support these theoretical findings. |
| Researcher Affiliation | Collaboration | Jincheng Mei Google Deep Mind jcmei@google.com Bo Dai Google Deep Mind bodai@google.com Alekh Agarwal Google Research alekhagarwal@google.com Mohammad Ghavamzadeh Amazon ghavamza@amazon.com Csaba Szepesvári Google Deep Mind University of Alberta szepi@google.com Dale Schuurmans Google Deep Mind University of Alberta daes@ualberta.ca |
| Pseudocode | Yes | Algorithm 1 Softmax policy gradient (PG) [...] Algorithm 2 Natural policy gradient (NPG) |
| Open Source Code | No | The paper does not provide any links to open-source code for the described methodology or explicitly state that code is released. |
| Open Datasets | No | The paper uses custom-defined examples (Example 1, 2, 3, 4, 5) with specific matrices and reward vectors, but these are not publicly available datasets in the conventional sense, nor are any links or citations provided for their access. |
| Dataset Splits | No | The paper does not provide explicit training/validation/test splits for the small, custom-defined examples used in the simulations. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the simulations or experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers used for the experiments. |
| Experiment Setup | Yes | We run Softmax PG and NPG on Example 1 with the same θ1 = (6, 8) R2. In Figure 1(a), the optimization trajectories show 85 iterations of NPG and 8.5 106 iterations of Softmax PG, both with learning rate η = 0.2. [...] The initialization is θ1 = (4, 10) , and η = 0.2. We run 150 iterations for NPG and 1.5 107 iterations for Softmax PG. [...] The initialization is θ1 = (10, 2) , and η = 0.2. We run 100 iterations for NPG and 2 106 iterations for Softmax PG. |