Policy Optimization via Importance Sampling
Authors: Alberto Maria Metelli, Matteo Papini, Francesco Faccio, Marcello Restelli
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, the algorithm is tested on a selection of continuous control tasks, with both linear and deep policies, and compared with state-of-the-art policy optimization methods. |
| Researcher Affiliation | Academia | Alberto Maria Metelli Politecnico di Milano, Milan, Italy albertomaria.metelli@polimi.it; Matteo Papini Politecnico di Milano, Milan, Italy matteo.papini@polimi.it; Francesco Faccio Politecnico di Milano, Milan, Italy IDSIA, USI-SUPSI, Lugano, Switzerland francesco.faccio@mail.polimi.it; Marcello Restelli Politecnico di Milano, Milan, Italy marcello.restelli@polimi.it |
| Pseudocode | Yes | The pseudo-code of POIS is reported in Algorithm 1. (Also Algorithm 2) |
| Open Source Code | Yes | The implementation of POIS can be found at https://github.com/T3p/pois. |
| Open Datasets | Yes | ...on classical control tasks [12, 57]. (Reference [12] is "Benchmarking deep reinforcement learning for continuous control" which uses standard environments.) |
| Dataset Splits | No | The paper describes using a "current policy" to collect trajectories for optimization, and performing "offline optimization". It does not explicitly mention fixed training, validation, or test dataset splits with percentages or counts, as is common in supervised learning contexts. |
| Hardware Specification | Yes | We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40cm, Titan XP and Tesla V100 used for this research. |
| Software Dependencies | No | The paper does not specify versions for any software dependencies, such as programming languages, libraries, or frameworks used for implementation. |
| Experiment Setup | Yes | All experimental details are provided in Appendix F. (Appendix F.1 mentions: "For linear policies we used a learning rate α = 0.001 and a batch size N = 20 trajectories." Appendix F.2 mentions: "We adopted the same network architecture for all environments: 3 layers: 100, 50, 25 neurons each.") |