Taylor Expansion Policy Optimization
Authors: Yunhao Tang, Michal Valko, Remi Munos
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the potential benefits of applying second-order expansions in a diverse set of scenarios. In particular, we test if the second-order correction helps with (1) policy-based and (2) value-based algorithms. In large-scale experiments, to take advantage of computational architectures, actors (µ) and learners (π) are not perfectly synchronized. ... Evaluation. All evaluation environments are done on the entire suite of Atari games (Bellemare et al., 2013). We report human-normalized scores for each level, calculated as zi = (ri oi)/(hi oi), where hi and oi are the performances of human and a random policy on level i respectively; with details in Appendix H.2. |
| Researcher Affiliation | Collaboration | 1Columbia University, New York, USA 2Deep Mind Paris, France. |
| Pseudocode | Yes | Algorithm 1 Tay PO-2: Second-order policy optimization |
| Open Source Code | No | The paper does not provide an unambiguous statement about releasing code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | All evaluation environments are done on the entire suite of Atari games (Bellemare et al., 2013). |
| Dataset Splits | No | The paper mentions training and evaluation but does not provide specific details on training/validation/test dataset splits needed for reproduction. |
| Hardware Specification | No | The paper mentions 'powerful computational architectures' and 'distributed on different host machines' but does not specify exact hardware details such as GPU models, CPU models, or memory. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers). |
| Experiment Setup | Yes | The paper provides details on hyper-parameters of experiment setups in respective subsections in Appendix H. For example, Appendix H.5 mentions: 'hyperparameters follow roughly those used in IMPALA (Espeholt et al., 2018), with learning rate 5e-5 and batch size 64.' |